###### Load Indeed Data and Rename Columns ######
indeed_data <- read.csv("indeed.csv")
colnames(indeed_data) <- c("Date", "job_postings")
###### Plot Original Data using Plotly ######
plot_ly(indeed_data, x = ~as.Date(Date)) %>%
add_lines(y = ~job_postings, name = 'Original Job Postings', line = list(color = 'blue')) %>%
layout(title = "Original Job Postings",
xaxis = list(title = "Date"),
yaxis = list(title = "Job Postings"))DSAN5600 Lab4: ARIMA and SARIMA Models
ARIMA Model
Class Example
Step 1: Plotting the data
As with any data analysis, the first step is to create a time plot of the data to visually inspect for any irregularities or patterns.
We will use Job Postings on Indeed in the United States as our example dataset.
The plot of job postings on Indeed in the United States from 2020 to 2025 reflects significant economic shifts over this period. A sharp decline in early 2020 corresponds to the onset of the COVID-19 pandemic, followed by a rapid recovery in 2021 as businesses adapted and reopened. Job postings peaked in early 2022, likely driven by post-pandemic economic optimism and increased hiring demand. However, from mid-2022 onward, there was a steady decline, suggesting a cooling labor market potentially influenced by inflation, rising interest rates, and broader economic tightening. By 2024, the decline stabilized, indicating that the job market may have reached a new equilibrium with more consistent, albeit cautious, hiring patterns.
Converting Stock to a Time Series Object
To proceed with modeling, it’s essential to convert the stock prices into a time series object. This format allows us to apply time series analysis techniques effectively, ensuring that the temporal structure of the data is preserved.
Note: We are currently working with a univariate time series, focusing solely on number of jobs posted in indeed.com.
###### Convert to Time Series Object ######
ts_indeed <- ts(indeed_data$job_postings,
start = decimal_date(as.Date(min(indeed_data$Date))),
frequency = 365.25) # Daily dataChecking for Stationarity and Serial Correlation
Before fitting an ARIMA model, it’s crucial to determine if the series is stationary—meaning its statistical properties, like mean and variance, remain consistent over time.
Step 2: Make the time series stationary
Let’s go back to our example.
# Plot the Autocorrelation Function (ACF) for Original Data
ggAcf(ts_indeed, 50) +
ggtitle("Autocorrelation Function of Original Job Postings") +
theme_minimal()To assess serial correlation, we use the Autocorrelation Function (ACF) plot. The ACF plot shows clear signs of high serial correlation, indicating that past values significantly influence future values. This indicates differencing is needed to achieve stationarity.
# Lag plot to visualize patterns in Original Data
gglagplot(ts_indeed, do.lines = FALSE) +
ggtitle("Lag Plot of Original Job Postings") +
theme_minimal()Lag plots indicate the same high correlation that is visible in the ACF plot.
Augmented Dickey Fuller (ADF) Test
The null hypothesis of the ADF test is that the time series is non-stationary.
So, if the p-value of the test is less than the significance level (0.05) then you reject the null hypothesis and infer that the time series is indeed stationary at 5% significance level.
So, in our case, if P Value > 0.05 indicate non-stationarity. Therefore, we go ahead with differencing the data to make the process stationary.
tseries::adf.test(ts_indeed)
Augmented Dickey-Fuller Test
data: ts_indeed
Dickey-Fuller = -1.1083, Lag order = 12, p-value = 0.9213
alternative hypothesis: stationary
P-value is greater than the significance level of 0.05 indicating that the series is not stationary.
Differencing
The purpose of differencing is to transform the time series into a stationary one, where statistical properties remain constant over time.
###### Differencing the Data ######
diff_indeed <- diff(ts_indeed)
# Plot Differenced Original Data
p1 <- ggplot() +
geom_line(aes(x = time(diff_indeed), y = diff_indeed), color = "blue") +
labs(title = "Differenced Original Job Postings",
x = "Time",
y = "Differenced Job Postings") +
theme_minimal()
# Display the Differenced Plot
p1Order of Differencing
library(patchwork)
###### First and Second Order Differencing ######
# First-order differencing
diff_1 <- diff(ts_indeed)
# Second-order differencing
diff_2 <- diff(ts_indeed, differences = 2)
###### Plot ACF for First and Second Order Differenced Series ######
# ACF plot for first-order differenced series
acf_plot_1 <- ggAcf(diff_1,50) +
ggtitle("ACF of First-Order Differenced Series") +
theme_minimal()
# ACF plot for second-order differenced series
acf_plot_2 <- ggAcf(diff_2,50) +
ggtitle("ACF of Second-Order Differenced Series") +
theme_minimal()
# Display both ACF plots side by side
acf_plot_1 / acf_plot_2The ACF plots provide valuable insights into the differencing required to achieve stationarity:
- First-Order Differencing:
The ACF of the first-order differenced series shows significant positive autocorrelations that persist across many lags. This indicates that first-order differencing is not sufficient to make the series stationary, as the correlations do not decay quickly to zero.
- Second-Order Differencing:
The ACF of the second-order differenced series shows a much better result. The autocorrelations drop close to zero rapidly, with only a few spikes within the confidence intervals. This suggests that the second-order differencing has effectively removed the trend, making the series closer to stationary.
While the second-order differencing (d = 2) seems to provide a more stationary series, we should be cautious not to over-difference. To comply with the principle of parsimony—which favors simpler models with fewer parameters—we can explore ARIMA models using both d = 1 and d = 2. Comparing model performance across different parameter combinations will help determine whether the additional differencing improves model accuracy or unnecessarily complicates the model.
Step 3: Order of the AR term (p)?
The next step is to determine whether the model requires any AutoRegressive (AR) terms. This can be done by analyzing the Partial Autocorrelation Function (PACF) plot, which helps identify the appropriate number of AR terms to include in the model.
# Plot the Partial Autocorrelation Function (PACF) for Differenced Data
p1<-ggPacf(diff_1,50) +
ggtitle("PACF of First Differenced Job Postings") +
theme_minimal()
p2<-ggPacf(diff_2,50) +
ggtitle("PACF of Second Differenced Job Postings") +
theme_minimal()
p1/p2Here you can observe, when d=1, there are high correlation at lags; p = 0,1,2,3 (rest of the lags are too high). When d=2, there are high correlations at; p =0,1,2. 0 is a choice here since the correlation is not dying down.
Step 4: Order of the MA term (q)
To determine the appropriate number of Moving Average (MA) terms for the model, we analyze the Autocorrelation Function (ACF) plot.
# Plot the Partial Autocorrelation Function (ACF) for Differenced Data
p1<-ggAcf(diff_1,50) +
ggtitle("ACF of First Differenced Job Postings") +
theme_minimal()
p2<-ggAcf(diff_2,50) +
ggtitle("ACF of Second Differenced Job Postings") +
theme_minimal()
p1/p2Here, it is clear that d=2 provides better stationarity. Hence, at d=2, we can find correlation at lags; q=1 (rest of the lags are too high for parameters)
Step 5: Parameter Tuning
Let’s go throgh different parameter combinations to find the best model.
# Load necessary libraries
library(knitr)
library(kableExtra)
# Define parameter ranges
p_range <- 0:4
d_range <- 1:2
q_range <- 1 # Single value for q
# Calculate total combinations
n_combinations <- length(p_range) * length(d_range) * length(q_range)
# Create an empty matrix to store results
results_matrix <- matrix(NA, nrow = n_combinations, ncol = 6)
# Initialize index for matrix row
i <- 1
# Loop through combinations of ARIMA model parameters
for (d in d_range) {
for (p in p_range) {
q <- q_range
# Fit ARIMA model with specified (p,d,q)
model <- Arima(ts_indeed, order = c(p, d, q), include.drift = TRUE)
# Store model parameters and AIC/BIC/AICc values in matrix
results_matrix[i, ] <- c(p, d, q, model$aic, model$bic, model$aicc)
# Increment row index
i <- i + 1
}
}
# Convert matrix to data frame
results_df <- as.data.frame(results_matrix)
colnames(results_df) <- c("p", "d", "q", "AIC", "BIC", "AICc")
# Find the row with the lowest AIC
highlight_row <- which.min(results_df$AIC)
# Generate kable table with highlighting for the row with the lowest AIC
knitr::kable(results_df, align = 'c', caption = "Comparison of ARIMA Models") %>%
kable_styling(full_width = FALSE, position = "center") %>%
row_spec(highlight_row, bold = TRUE, background = "#FFFF99") # Highlight row in yellow| p | d | q | AIC | BIC | AICc |
|---|---|---|---|---|---|
| 0 | 1 | 1 | -1444.427 | -1427.897 | -1444.414 |
| 1 | 1 | 1 | -2889.317 | -2867.278 | -2889.295 |
| 2 | 1 | 1 | -2888.179 | -2860.630 | -2888.146 |
| 3 | 1 | 1 | -2888.354 | -2855.294 | -2888.308 |
| 4 | 1 | 1 | -2886.284 | -2847.715 | -2886.222 |
| 0 | 2 | 1 | -2837.687 | -2826.668 | -2837.680 |
| 1 | 2 | 1 | -2835.773 | -2819.245 | -2835.759 |
| 2 | 2 | 1 | -2879.716 | -2857.679 | -2879.694 |
| 3 | 2 | 1 | -2888.634 | -2861.088 | -2888.601 |
| 4 | 2 | 1 | -2832.020 | -2798.964 | -2831.973 |
Here, it indicates that ARIMA(1,1,1) model has the lowest AIC and BIC values.
However, Let’s check with auto.arima():
auto.arima(ts_indeed)Series: ts_indeed
ARIMA(4,2,2)
Coefficients:
ar1 ar2 ar3 ar4 ma1 ma2
-0.7639 -1.1123 -0.4509 -0.2923 0.4395 0.9794
s.e. 0.0227 0.0270 0.0268 0.0227 0.0047 0.0061
sigma^2 = 0.01021: log likelihood = 1595.9
AIC=-3177.8 AICc=-3177.74 BIC=-3139.23
This indicates ARIMA(4,2,2) model. Therefore, I will run the parameter combination again with adding p=4 and q=2.
# Load necessary libraries
library(knitr)
library(kableExtra)
library(forecast)
# Create an empty matrix to store results
results_matrix <- matrix(rep(NA, 6 * 20), nrow = 20)
# Initialize index for matrix row
i <- 1
# Loop through ARIMA model parameters: d = 1,2; p = 0:4; q = 1,2
for (d in 1:2) {
for (p in 0:4) {
for (q in 1:2) {
# Ensure 'ts_indeed' is a univariate time series by selecting the correct column
model <- Arima(ts_indeed, order = c(p, d, q), include.drift = TRUE)
# Store model parameters and AIC/BIC/AICc values in matrix
results_matrix[i, ] <- c(p, d, q, model$aic, model$bic, model$aicc)
# Increment row index
i <- i + 1
}
}
}
# Convert matrix to data frame
results_df <- as.data.frame(results_matrix)
colnames(results_df) <- c("p", "d", "q", "AIC", "BIC", "AICc")
# Find the row with the minimum AIC
highlight_row <- which.min(results_df$AIC)
# Generate kable table with highlighting for the row with the minimum AIC
knitr::kable(results_df, align = 'c', caption = "Comparison of ARIMA Models") %>%
kable_styling(full_width = FALSE, position = "center") %>%
row_spec(highlight_row, bold = TRUE, background = "#FFFF99") # Highlight row in yellow| p | d | q | AIC | BIC | AICc |
|---|---|---|---|---|---|
| 0 | 1 | 1 | -1444.427 | -1427.897 | -1444.414 |
| 0 | 1 | 2 | -2072.407 | -2050.367 | -2072.385 |
| 1 | 1 | 1 | -2889.317 | -2867.278 | -2889.295 |
| 1 | 1 | 2 | -2888.391 | -2860.842 | -2888.358 |
| 2 | 1 | 1 | -2888.179 | -2860.630 | -2888.146 |
| 2 | 1 | 2 | -2899.974 | -2866.915 | -2899.928 |
| 3 | 1 | 1 | -2888.354 | -2855.294 | -2888.308 |
| 3 | 1 | 2 | -2886.356 | -2847.787 | -2886.294 |
| 4 | 1 | 1 | -2886.284 | -2847.715 | -2886.222 |
| 4 | 1 | 2 | -3102.517 | -3058.438 | -3102.438 |
| 0 | 2 | 1 | -2837.687 | -2826.668 | -2837.680 |
| 0 | 2 | 2 | -2835.771 | -2819.243 | -2835.758 |
| 1 | 2 | 1 | -2835.773 | -2819.245 | -2835.759 |
| 1 | 2 | 2 | -2833.791 | -2811.753 | -2833.769 |
| 2 | 2 | 1 | -2879.716 | -2857.679 | -2879.694 |
| 2 | 2 | 2 | -2831.824 | -2804.277 | -2831.791 |
| 3 | 2 | 1 | -2888.634 | -2861.088 | -2888.601 |
| 3 | 2 | 2 | -2884.149 | -2851.093 | -2884.103 |
| 4 | 2 | 1 | -2832.020 | -2798.964 | -2831.973 |
| 4 | 2 | 2 | -3177.797 | -3139.231 | -3177.735 |
It is indeed that the model ARIMA(4,2,2) has the lowest AIC value. However, ARIMA(4,1,2) has the lowest BIC value.
Step 6: Model Selection - Model Diagnostics
We have several choices for our models. Let’s check the model diagnostics to find the best model.
The Models chosen by the above analysis:
- ARIMA(1,1,1)
- ARIMA(4,1,2)
- ARIMA(4,2,2)
ARIMA(1,1,1)
# Set seed for reproducibility
set.seed(345)
##### Capture ARIMA model output for diagnostics #####
# Example model diagnostics for ARIMA(1,1,1)
model_output <- capture.output(sarima(ts_indeed, 1, 1, 1))# Find the line numbers dynamically based on a keyword
start_line <- grep("Coefficients", model_output) # Locate where coefficient details start
end_line <- length(model_output) # Last line of output
# Print the relevant section automatically
cat(model_output[start_line:end_line], sep = "\n")Coefficients:
Estimate SE t.value p.value
ar1 0.9371 0.0091 102.7188 0.0000
ma1 -0.3222 0.0242 -13.3017 0.0000
constant 0.0057 0.0274 0.2095 0.8341
sigma^2 estimated as 0.01196945 on 1823 degrees of freedom
AIC = -1.58232 AICc = -1.582313 BIC = -1.570251
The Residual Plot shows nearly consistent fluctuation around zero, suggesting that the residuals are nearly stationary with a constant mean and finite variance over time.
The Autocorrelation Function (ACF) of the residuals reveals no significant autocorrelations except around lag 10, which is too high to include in the model.
The Q-Q Plot indicates that the residuals follow a near-normal distribution, with minor deviations at the tails, which is typical in time series data.
The Ljung-Box Test p-values remain mostly below the 0.05 significance level, implying some significant autocorrelations left in the residuals and concluding that model improvements may necessary.
Coefficient Significance: All model coefficients are significant for the ARIMA(1,1,1) model.
ARIMA(4,1,2)
# Set seed for reproducibility
set.seed(345)
##### Capture ARIMA(4,1,2) model output for diagnostics #####
# Fit ARIMA(4,1,2) model
model_output <- capture.output(sarima(ts_indeed, 4, 1, 2))# Find the line numbers dynamically based on a keyword
start_line <- grep("Coefficients", model_output) # Locate where coefficient details start
end_line <- length(model_output) # Last line of output
# Print the relevant section automatically
cat(model_output[start_line:end_line], sep = "\n")Coefficients:
Estimate SE t.value p.value
ar1 0.2480 0.0235 10.5738 0.0000
ar2 -0.2489 0.0205 -12.1517 0.0000
ar3 0.5513 0.0211 26.1541 0.0000
ar4 0.1946 0.0232 8.3889 0.0000
ma1 0.4427 0.0045 98.1955 0.0000
ma2 0.9893 0.0042 236.8259 0.0000
constant 0.0056 0.0228 0.2432 0.8079
sigma^2 estimated as 0.01057986 on 1819 degrees of freedom
AIC = -1.699078 AICc = -1.699045 BIC = -1.674939
The Residual Plot shows nearly consistent fluctuation around zero, suggesting that the residuals are nearly stationary with a constant mean and finite variance over time.
The Autocorrelation Function (ACF) of the residuals reveals significant autocorrelations around lag 2 and high order lags, which are too high to include in the model.
The Q-Q Plot indicates that the residuals follow a near-normal distribution, with minor deviations at the tails, which is typical in time series data.
The Ljung-Box Test p-values remain below the 0.05 significance level, implying some significant autocorrelations left in the residuals and concluding that model improvements may necessary.
Coefficient Significance: All model coefficients are significant for the ARIMA(4,1,2) model.
ARIMA(4,2,2)
# Set seed for reproducibility
set.seed(345)
##### Capture ARIMA(4,2,2) model output for diagnostics #####
# Fit ARIMA(4,2,2) model
model_output <- capture.output(sarima(ts_indeed, 4, 2, 2))# Find the line numbers dynamically based on a keyword
start_line <- grep("Coefficients", model_output) # Locate where coefficient details start
end_line <- length(model_output) # Last line of output
# Print the relevant section automatically
cat(model_output[start_line:end_line], sep = "\n")Coefficients:
Estimate SE t.value p.value
ar1 -0.7639 0.0227 -33.7260 0
ar2 -1.1123 0.0270 -41.2599 0
ar3 -0.4509 0.0268 -16.8315 0
ar4 -0.2923 0.0227 -12.8834 0
ma1 0.4395 0.0047 92.8622 0
ma2 0.9794 0.0061 161.3321 0
sigma^2 estimated as 0.01016194 on 1819 degrees of freedom
AIC = -1.741258 AICc = -1.741233 BIC = -1.720127
The Residual Plot shows nearly consistent fluctuation around zero(better than the two plots above), suggesting that the residuals are nearly stationary with a constant mean and finite variance over time.
The Autocorrelation Function (ACF) of the residuals still shows some significant autocorrelations at high lags, which is too high to include in the model.
The Q-Q Plot indicates that the residuals follow a near-normal distribution, with minor deviations at the tails, which is typical in time series data.
The Ljung-Box Test p-values remain below the 0.05 significance level, implying some significant autocorrelations left in the residuals and concluding that model improvements may necessary.
Coefficient Significance: All model coefficients are significant.
Conclusion:
Model diagnostics were similar with all the above models. Therefore, we will choose the model with lowest AIC which is ARIMA(4,2,2).
Model fitting
# Fit ARIMA(4,2,2) model
fit <- Arima(ts_indeed, order = c(4, 2, 2), include.drift = FALSE)
# Display model summary
summary(fit)Series: ts_indeed
ARIMA(4,2,2)
Coefficients:
ar1 ar2 ar3 ar4 ma1 ma2
-0.7639 -1.1123 -0.4509 -0.2923 0.4395 0.9794
s.e. 0.0227 0.0270 0.0268 0.0227 0.0047 0.0061
sigma^2 = 0.01021: log likelihood = 1595.9
AIC=-3177.8 AICc=-3177.74 BIC=-3139.23
Training set error measures:
ME RMSE MAE MPE MAPE
Training set -9.759566e-05 0.1008055 0.07215598 0.000640841 0.06057821
MASE ACF1
Training set 0.002546013 0.007569145
Model Equation
A process \(x_t\) is said to be \(\operatorname{ARIMA}(p, d, q)\) if
\[ \nabla^d x_t=(1-B)^d x_t \]
Where;
\[ \phi(B)=1+0.76 B+1.11 B^2+ 0.45 B^3 + 0.29 B^4 \]
\[ \theta(B)=1+ 0.44 B+ 0.98 B^2 \]
Step 7: Model Forecasting
Forecasting is the final and most crucial step in time series modeling, as the primary goal of building these models is to generate accurate predictions of future values.
# Forecast the next 365 periods
forecast_result <- forecast(fit, h = 365)
# Display forecast accuracy
accuracy(forecast_result) ME RMSE MAE MPE MAPE
Training set -9.759566e-05 0.1008055 0.07215598 0.000640841 0.06057821
MASE ACF1
Training set 0.002546013 0.007569145
# Plot the forecast
autoplot(forecast_result) +
labs(title = "ARIMA(4,2,2) Forecast",
x = "Time",
y = "Predicted Values") +
theme_minimal()The forecast for job postings over the next year (approximately 365 days) suggests a potential decline in the number of new postings. This downward trend could be attributed to seasonal hiring patterns, where many summer and fall positions are typically posted earlier in the year, particularly in January. Another possible explanation could be a broader market trend indicating a slowdown in job availability.
The confidence intervals shown in the shaded blue area represent the range within which the actual job postings are expected to fall. The widening of these intervals as we move further into the forecast period reflects increasing uncertainty. While the central forecast suggests a decline, the broader confidence range indicates that there is still some possibility for fluctuations—either a steeper decline or a stabilization in job postings. This uncertainty highlights the need to monitor the job market closely over the coming months to determine whether this trend is seasonal or indicative of a larger economic shift.
Benchmark Methods
For univariate time series forecasting, establishing benchmarks is straightforward. At a minimum, forecasts should be compared against a naive method and a standard approach like an ARIMA model to evaluate the effectiveness of new techniques.
By comparing new forecasting models against these baseline methods, we can better understand whether the proposed techniques provide meaningful improvements.
Let’s go back to our example on the job postings at indeed.com.
# Fit models individually and check residuals
f_mean <- meanf(ts_indeed, h = 200)
head(f_mean$mean) #gives the forecasted valuesTime Series:
Start = 2025.08675284164
End = 2025.10044209558
Frequency = 365.25
[1] 123.8298 123.8298 123.8298 123.8298 123.8298 123.8298
mean(ts_indeed)[1] 123.8298
f_naive <- naive(ts_indeed, h = 200)
#checkresiduals(f_naive)
f_drift <- rwf(ts_indeed, drift = TRUE, h = 200)
checkresiduals(f_drift)
Ljung-Box test
data: Residuals from Random walk with drift
Q* = 19674, df = 365, p-value < 2.2e-16
Model df: 0. Total lags used: 365
# Plot forecasts using Mean, Naïve, Drift methods, and ARIMA fit
autoplot(ts_indeed) +
autolayer(meanf(ts_indeed, h = 365), series = "Mean", PI = FALSE) +
autolayer(naive(ts_indeed, h = 365), series = "Naïve", PI = FALSE) +
autolayer(rwf(ts_indeed, drift = TRUE, h = 365), series = "Drift", PI = FALSE) +
autolayer(forecast(fit, h = 365), series = "ARIMA Fit", PI = FALSE) +
ggtitle("Job Postings Forecast") +
xlab("Date") + ylab("Number of Job Postings") +
guides(colour = guide_legend(title = "Forecast Methods")) +
theme_minimal()In plotly so we can zoom the forecasts:
# Generate forecasts
mean_forecast <- meanf(ts_indeed, h = 365)
naive_forecast <- naive(ts_indeed, h = 365)
drift_forecast <- rwf(ts_indeed, drift = TRUE, h = 365)
arima_forecast <- forecast(fit, h = 365)
# Convert forecasts to data frames
mean_df <- data.frame(Date = time(mean_forecast$mean), Mean = as.numeric(mean_forecast$mean))
naive_df <- data.frame(Date = time(naive_forecast$mean), Naive = as.numeric(naive_forecast$mean))
drift_df <- data.frame(Date = time(drift_forecast$mean), Drift = as.numeric(drift_forecast$mean))
arima_df <- data.frame(Date = time(arima_forecast$mean), ARIMA_Fit = as.numeric(arima_forecast$mean))
# Original data
ts_indeed_df <- data.frame(Date = time(ts_indeed), Job_Postings = as.numeric(ts_indeed))
# Create Plotly plot
plot_ly() %>%
add_lines(data = ts_indeed_df, x = ~Date, y = ~Job_Postings, name = 'Original Data', line = list(color = 'black')) %>%
add_lines(data = mean_df, x = ~Date, y = ~Mean, name = 'Mean Forecast', line = list(color = 'blue')) %>%
add_lines(data = naive_df, x = ~Date, y = ~Naive, name = 'Naïve Forecast', line = list(color = 'red')) %>%
add_lines(data = drift_df, x = ~Date, y = ~Drift, name = 'Drift Forecast', line = list(color = 'green')) %>%
add_lines(data = arima_df, x = ~Date, y = ~ARIMA_Fit, name = 'ARIMA Fit', line = list(color = 'purple')) %>%
layout(title = 'Job Postings Forecast',
xaxis = list(title = 'Date'),
yaxis = list(title = 'Number of Job Postings'),
legend = list(title = list(text = 'Forecast Methods')))By zooming n, it is very clear that the ARIMA model does a better method than the basic bench mark methods.
fpp3 package
fpp3 package has a new tool for fitting models automatically similar to auto.arima() but the model procedure is different in the fpp3 package.
# Load necessary libraries
library(tsibble)
library(fable)
library(feasts)
library(dplyr)
# Ensure 'Date' is in date format
indeed_data$Date <- as.Date(indeed_data$Date)
# Convert to tsibble
ts_indeed2 <- indeed_data %>%
as_tsibble(index = Date)
# Fit ARIMA model
fit <- ts_indeed2 %>%
model(ARIMA(job_postings))
# Display model summary
report(fit)Series: job_postings
Model: ARIMA(2,2,2)(1,0,1)[7]
Coefficients:
ar1 ar2 ma1 ma2 sar1 sma1
1.2927 -0.3257 -1.5437 0.6053 -0.0245 -0.9126
s.e. 0.0792 0.0781 0.0679 0.0660 0.0291 0.0116
sigma^2 estimated as 0.007068: log likelihood=1927.94
AIC=-3841.87 AICc=-3841.81 BIC=-3803.31
This is a seasonal model. This could be better considering the data was daily and it was indicated that the Frequency was Daily,7-Day source. We can revisit in the next chapter.
Check the model diagnostics:
# Set seed for reproducibility
set.seed(345)
##### Capture ARIMA(2,2,2)(1,0,1)[7] model output for diagnostics #####
# Fit seasonal ARIMA model
model_output <- capture.output(sarima(ts_indeed, 2, 2, 2, 1, 0, 1, 7))# Find the line numbers dynamically based on a keyword
start_line <- grep("Coefficients", model_output) # Locate where coefficient details start
end_line <- length(model_output) # Last line of output
# Print the relevant section automatically
cat(model_output[start_line:end_line], sep = "\n")Coefficients:
Estimate SE t.value p.value
ar1 1.2927 0.0792 16.3193 0.0000
ar2 -0.3257 0.0781 -4.1686 0.0000
ma1 -1.5437 0.0679 -22.7379 0.0000
ma2 0.6053 0.0660 9.1692 0.0000
sar1 -0.0245 0.0291 -0.8422 0.3998
sma1 -0.9126 0.0116 -78.4846 0.0000
sigma^2 estimated as 0.007033595 on 1819 degrees of freedom
AIC = -2.105135 AICc = -2.105109 BIC = -2.084003
Still it dosen’t make a big difference but seasonal ar1 coefficient is not significant here. However, this model has a lower AIC value than the ARIMA(4,2,2) model.
# Load necessary library
library(forecast)
# Fit ARIMA(2,2,2)(1,0,1)[7] model
fit <- Arima(ts_indeed, order = c(2, 2, 2), seasonal = list(order = c(1, 0, 1), period = 7), include.drift = FALSE)
# Display model summary
summary(fit)Series: ts_indeed
ARIMA(2,2,2)(1,0,1)[7]
Coefficients:
ar1 ar2 ma1 ma2 sar1 sma1
1.2927 -0.3257 -1.5437 0.6053 -0.0245 -0.9126
s.e. 0.0792 0.0781 0.0679 0.0660 0.0291 0.0116
sigma^2 = 0.007068: log likelihood = 1927.94
AIC=-3841.87 AICc=-3841.81 BIC=-3803.31
Training set error measures:
ME RMSE MAE MPE MAPE
Training set 0.0001746875 0.08388585 0.05719408 0.001490598 0.04855488
MASE ACF1
Training set 0.002018085 -0.01442729
# Generate the forecast
forecast_result <- forecast(fit, h = 365)
# Plot the forecast
autoplot(forecast_result) +
labs(title = "ARIMA(2,2,2)(1,0,1)[7] Forecast",
x = "Time",
y = "Predicted Job Postings") +
theme_minimal()Prophet Model
Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.
Prophet is open source software released by Facebook’s Core Data Science team. It is available for download on CRAN and PyPI.
In R
# Load required libraries
library(prophet)
# Prepare the dataset for Prophet
job_postings_data <- indeed_data %>%
select(Date, job_postings) %>%
rename(ds = Date, y = job_postings)
# Initialize and fit the Prophet model
prophet_model <- prophet(job_postings_data)
# Extend the forecast horizon by 365 days
future_dates <- make_future_dataframe(prophet_model, periods = 365)
# Generate future predictions
predictions <- predict(prophet_model, future_dates)
# Visualize the forecast
plot(prophet_model, predictions)This forecasting looks better than the simple ARIMA or the SARIMA model. We will discuss more examples later.
Financial Time Series: Apple Stock prices
We will use Apple stock prices as our example dataset.
###### Load Apple Stock Data from Yahoo Finance ######
AAPL <- getSymbols("AAPL", auto.assign = FALSE, from = "1980-01-01", src = "yahoo")
###### Convert to Data Frame and Add Date Column ######
AAPL <- data.frame(AAPL)
AAPL <- data.frame(AAPL, rownames(AAPL))
###### Rename the Date Column ######
colnames(AAPL)[7] <- "date"
###### Convert Date Column to Date Format ######
AAPL$date <- as.Date(AAPL$date, "%Y-%m-%d")
###### Create Data Frame with Original and Log-Transformed Adjusted Close Prices ######
data <- data.frame(AAPL$date, AAPL$AAPL.Adjusted, log(AAPL$AAPL.Adjusted))
colnames(data) <- c("date", "AAPL", "log.AAPL")
###### Plot Original and Log-Transformed Adjusted Close Prices ######
ggplot(data) +
geom_line(aes(x = date, y = AAPL, color = "Adjusted Close")) +
geom_line(aes(x = date, y = log.AAPL, color = "Log Adjusted Close")) +
labs(title = "Apple Stock Prices (Adjusted & Log-Transformed)",
x = "Date",
y = "Price",
color = "Legend") +
theme_minimal()AAPL is the ticker symbol for Apple Inc.’s stock, and we have retrieved its historical prices from Yahoo Finance.
Upon plotting the data, we observe heteroscedasticity, where the variability increases over time. This non-constant variance can negatively impact model accuracy. To mitigate this, we apply a log transformation, which helps stabilize the variance.
The log-transformed data reduces heteroscedasticity, providing a more consistent trend and making it more suitable for reliable model fitting.
###### Convert to Time Series Object ######
ts_AAPL <- ts(data$AAPL,
start = decimal_date(as.Date("1980-12-12")),
frequency = 252) # Approx. number of trading days in a year
# Plot the Autocorrelation Function (ACF)
ggAcf(ts_AAPL,50) +
ggtitle("Autocorrelation Function of Apple Stock Prices") +
theme_minimal()To assess serial correlation, we use the Autocorrelation Function (ACF) plot. The ACF plot shows clear signs of high serial correlation, indicating that past values significantly influence future values. This indicates differencing is needed to achieve stationarity.
# Lag plot to visualize patterns at different lags
gglagplot(ts_AAPL, do.lines = FALSE) +
ggtitle("Lag Plot of Apple Stock Prices") +
theme_minimal()# Plot the Time Series with Moving Averages
autoplot(ts_AAPL, series = "Apple Adjusted Close") +
autolayer(ma(ts_AAPL, 200), series = "200-Day Moving Average") +
autolayer(ma(ts_AAPL, 500), series = "500-Day Moving Average") +
labs(title = "Apple Stock Prices with Moving Averages",
x = "Year",
y = "Adjusted Close Price",
color = "Legend") +
theme_minimal()Applying a high moving average (MA) helps smooth out short-term fluctuations, making long-term trends and patterns more visible. In this case, the moving average highlights a substantial increase in Apple’s stock prices, particularly after the year 2000.
Log transformation
###### Load Required Libraries ######
library(patchwork)
###### Create Data Frame with Original and Log-Transformed Adjusted Close Prices ######
data <- data.frame(AAPL$date, AAPL$AAPL.Adjusted, log(AAPL$AAPL.Adjusted))
colnames(data) <- c("date", "AAPL", "log.AAPL")
###### Convert to Time Series Object ######
ts_AAPL <- ts(data$AAPL,
start = decimal_date(as.Date(min(data$date))),
frequency = 12) # Assuming monthly data
ts_log_AAPL <- ts(data$log.AAPL,
start = decimal_date(as.Date(min(data$date))),
frequency = 12)
###### Differencing the Data ######
diff_AAPL <- diff(ts_AAPL)
diff_log_AAPL <- diff(ts_log_AAPL)
# Create plots for differenced data
p1 <- ggplot() +
geom_line(aes(x = time(diff_AAPL), y = diff_AAPL), color = "blue") +
labs(title = "Differenced Apple Stock Prices",
x = "Time",
y = "Differenced Prices") +
theme_minimal()
p2 <- ggplot() +
geom_line(aes(x = time(diff_log_AAPL), y = diff_log_AAPL), color = "red") +
labs(title = "Differenced Log-Transformed Apple Stock Prices",
x = "Time",
y = "Differenced Log Prices") +
theme_minimal()
# Combine plots using patchwork
p1 / p2It is clear that the data needs to be transformed before fitting model.
Differencing - parameter selection
###### Differencing the Log-Transformed Data ######
diff_log_AAPL <- diff(ts_log_AAPL)
diff2_log_AAPL <- diff(ts_log_AAPL, differences = 2)
# ACF Plots for Differenced Log-Transformed Data
acf_plot1 <- ggAcf(diff_log_AAPL) +
labs(title = "ACF of First Differenced Log-Transformed Apple Stock Prices") +
theme_minimal()
acf_plot2 <- ggAcf(diff2_log_AAPL) +
labs(title = "ACF of Second Differenced Log-Transformed Apple Stock Prices") +
theme_minimal()
# Combine ACF Plots using patchwork
acf_plot1 / acf_plot2First order differencing is sufficient.
###### ACF and PACF Plots for First Differenced Log-Transformed Data ######
acf_plot <- ggAcf(diff_log_AAPL) +
labs(title = "ACF of First Differenced Log-Transformed Apple Stock Prices") +
theme_minimal()
pacf_plot <- ggPacf(diff_log_AAPL) +
labs(title = "PACF of First Differenced Log-Transformed Apple Stock Prices") +
theme_minimal()
# Combine ACF and PACF Plots using patchwork
acf_plot / pacf_plotd=1, p=2:4, q = 2:4
###### Load Necessary Libraries ######
library(knitr)
library(kableExtra)
library(forecast)
###### Create an Empty Matrix to Store Results ######
results_matrix <- matrix(rep(NA, 6 * 9), nrow = 9)
###### Initialize Index for Matrix Row ######
i <- 1
###### Loop through ARIMA Model Parameters: d = 1; p = 2:4; q = 2:4 ######
for (p in 2:4) {
for (q in 2:4) {
# Fit ARIMA model
model <- Arima(diff_log_AAPL, order = c(p, 1, q), include.drift = TRUE)
# Store model parameters and AIC/BIC/AICc values in matrix
results_matrix[i, ] <- c(p, 1, q, model$aic, model$bic, model$aicc)
# Increment row index
i <- i + 1
}
}
###### Convert Matrix to Data Frame ######
results_df <- as.data.frame(results_matrix)
colnames(results_df) <- c("p", "d", "q", "AIC", "BIC", "AICc")
###### Find the Row with the Minimum AIC ######
highlight_row <- which.min(results_df$AIC)
###### Generate kable Table with Highlighting for the Row with the Minimum AIC ######
knitr::kable(results_df, align = 'c', caption = "Comparison of ARIMA Models") %>%
kable_styling(full_width = FALSE, position = "center") %>%
row_spec(highlight_row, bold = TRUE, background = "#FFFF99") # Highlight row in yellow| p | d | q | AIC | BIC | AICc |
|---|---|---|---|---|---|
| 2 | 1 | 2 | -47840.78 | -47796.87 | -47840.77 |
| 2 | 1 | 3 | -47849.91 | -47798.69 | -47849.90 |
| 2 | 1 | 4 | -47856.66 | -47798.13 | -47856.65 |
| 3 | 1 | 2 | -47857.65 | -47806.43 | -47857.64 |
| 3 | 1 | 3 | -47856.15 | -47797.61 | -47856.13 |
| 3 | 1 | 4 | -47855.38 | -47789.52 | -47855.36 |
| 4 | 1 | 2 | -47852.38 | -47793.85 | -47852.37 |
| 4 | 1 | 3 | -47854.87 | -47789.01 | -47854.85 |
| 4 | 1 | 4 | -47839.01 | -47765.84 | -47838.99 |
ARIMA(3,1,2) is a good model.
Let’s check auto.arima()
auto.arima(ts_log_AAPL)Series: ts_log_AAPL
ARIMA(4,1,0)(1,0,0)[12] with drift
Coefficients:
ar1 ar2 ar3 ar4 sar1 drift
0.0159 -0.0192 -0.0303 0.0271 0.0374 7e-04
s.e. 0.0095 0.0095 0.0095 0.0095 0.0095 3e-04
sigma^2 = 0.0007917: log likelihood = 23951.77
AIC=-47889.54 AICc=-47889.53 BIC=-47838.32
This suggests ARIMA(4,0,0)(1,0,0)[12].
Let’s check fpp3 package:
# Load necessary libraries
library(tsibble)
library(fable)
library(feasts)
library(dplyr)
###### Prepare the Log-Transformed Apple Data ######
# Ensure 'Date' is in date format
AAPL$date <- as.Date(AAPL$date)
# Create a data frame with log-transformed adjusted close prices
apple_data <- AAPL %>%
select(date, AAPL.Adjusted) %>%
mutate(log_AAPL = log(AAPL.Adjusted))
# Convert to tsibble and fill implicit gaps
ts_apple <- apple_data %>%
as_tsibble(index = date) %>%
fill_gaps()
###### Fit ARIMA Model on Log-Transformed Data ######
fit <- ts_apple %>%
model(ARIMA(log_AAPL))
###### Display Model Summary ######
report(fit)Series: log_AAPL
Model: ARIMA(0,0,0) w/ mean
Coefficients:
constant
0.5543
s.e. 0.0241
sigma^2 estimated as 4.446: log likelihood=-26163.24
AIC=52330.47 AICc=52330.47 BIC=52345.85
This gives me a constant model which is not a good model.
Model Comparison
ARIMA(3,1,2)
# Set seed for reproducibility
set.seed(345)
##### Capture ARIMA(3,1,2) model output for diagnostics #####
# Fit ARIMA(4,2,2) model
model_output <- capture.output(sarima(ts_log_AAPL, 3, 1, 2))# Find the line numbers dynamically based on a keyword
start_line <- grep("Coefficients", model_output) # Locate where coefficient details start
end_line <- length(model_output) # Last line of output
# Print the relevant section automatically
cat(model_output[start_line:end_line], sep = "\n")Coefficients:
Estimate SE t.value p.value
ar1 -0.4991 0.2213 -2.2555 0.0241
ar2 -0.1500 0.1616 -0.9283 0.3533
ar3 -0.0408 0.0101 -4.0203 0.0001
ma1 0.5154 0.2214 2.3281 0.0199
ma2 0.1381 0.1622 0.8514 0.3946
constant 0.0007 0.0003 2.6360 0.0084
sigma^2 estimated as 0.0007924691 on 11124 degrees of freedom
AIC = -4.301222 AICc = -4.301221 BIC = -4.29662
The Residual Plot shows nearly consistent fluctuation around zero, but we see some clustering, suggesting that the financial models would be better for this data.
The Autocorrelation Function (ACF) of the residuals shows mostly independence..
The Q-Q Plot indicates that the residuals follow a somewhat-normal distribution, with minor deviations at the tails, which indicate we need model improvements.
The Ljung-Box Test p-values remain mostly below the 0.05 significance level, implying some significant autocorrelations left in the residuals and concluding that model improvements may necessary.
Coefficient Significance: ar2 and ma2 coefficients are not significant.
ARIMA(4,0,0)(1,0,0)[12]
###### Set Seed for Reproducibility ######
set.seed(345)
###### Capture ARIMA Model Output for Diagnostics ######
# Model diagnostics for ARIMA(4,0,0)x(1,0,0)[12]
model_output <- capture.output(sarima(ts_log_AAPL, p = 4, d = 0, q = 0, P = 1, D = 0, Q = 0, S = 12))
# Find the line numbers dynamically based on a keyword
start_line <- grep("Coefficients", model_output) # Locate where coefficient details start
end_line <- length(model_output) # Last line of output
# Print the relevant section automatically
cat(model_output[start_line:end_line], sep = "\n")This gives the below error message stating that it has non-stationary AR part which means this is not a good fit for the data:
“Error in arima(xdata, order = c(p, d, q), seasonal = list(order = c(P, : non-stationary AR part from CSS”
Model Forecast
so far ARIMA(3,1,2) is better.
fit <- Arima(ts_log_AAPL, order = c(3, 1, 2), include.drift = FALSE)
# Display model summary
summary(fit)Series: ts_log_AAPL
ARIMA(3,1,2)
Coefficients:
ar1 ar2 ar3 ma1 ma2
-0.5074 -0.1566 -0.0401 0.5244 0.1456
s.e. 0.2193 0.1608 0.0102 0.2193 0.1617
sigma^2 = 0.0007933: log likelihood = 23939.76
AIC=-47867.53 AICc=-47867.52 BIC=-47823.62
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set 0.0007096621 0.02815851 0.0190339 -0.08795954 2.423098 0.270375
ACF1
Training set -0.0004307023
Model Equation
A process \(x_t\) is said to be \(\operatorname{ARIMA}(p, d, q)\) if
\[ \nabla^d x_t=(1-B)^d x_t \] Where;
\[ \phi(B)=1+0.51 B+0.16 B^2+ 0.04 B^3 \]
\[ \theta(B)=1+ 0.52 B+ 0.14 B^2 \] #### Model Forecast
# Forecast the next 365 periods
forecast_result <- forecast(fit, h = 365)
# Display forecast accuracy
accuracy(forecast_result) ME RMSE MAE MPE MAPE MASE
Training set 0.0007096621 0.02815851 0.0190339 -0.08795954 2.423098 0.270375
ACF1
Training set -0.0004307023
# Plot the forecast
autoplot(forecast_result) +
labs(title = "ARIMA(3,1,2) Forecast",
x = "Time",
y = "Predicted Values") +
theme_minimal()###### Plot Forecasts Using Mean, Naïve, Drift Methods, and ARIMA Fit ######
autoplot(ts_log_AAPL) +
autolayer(meanf(ts_log_AAPL, h = 365), series = "Mean", PI = FALSE) +
autolayer(naive(ts_log_AAPL, h = 365), series = "Naïve", PI = FALSE) +
autolayer(rwf(ts_log_AAPL, drift = TRUE, h = 365), series = "Drift", PI = FALSE) +
autolayer(forecast(fit, h = 365), series = "ARIMA Fit", PI = FALSE) +
ggtitle("Apple Stock Price Forecast (Log-Transformed)") +
xlab("Date") + ylab("Log Adjusted Close Price") +
guides(colour = guide_legend(title = "Forecast Methods")) +
theme_minimal()It looks like the drift model is catching the trend better than the ARIMA model. This means that this ARIMA model is not sufficient. Which is true. This is financial data. an ARCH/GARCH model needs to be fitted for this data.
SARIMA Model
\[ARIMA(p,d,q)X(P,D,Q)_s\]
Class Example
We will use Egg Prices as our example dataset.
###### Load Egg Price Data and Rename Columns ######
eggs_data <- read.csv("eggs_price_2025.csv")
colnames(eggs_data) <- c("Date", "egg_prices")
###### Plot Original Data using Plotly ######
plot_ly(eggs_data, x = ~as.Date(Date), y = ~egg_prices, type = 'scatter', mode = 'lines',
line = list(color = 'blue')) %>%
layout(title = "Original Egg Prices",
xaxis = list(title = "Date"),
yaxis = list(title = "Egg Prices"))Upon plotting the data, we observe heteroscedasticity, where the variability increases over time. This non-constant variance can negatively impact model accuracy. To mitigate this, we apply a log transformation, which helps stabilize the variance.
The log-transformed data reduces heteroscedasticity, providing a more consistent trend and making it more suitable for reliable model fitting.
To confirm, let’s compare the logged-transformed data and the differenced data.
###### Plot Original and Log-Transformed Data using Plotly ######
eggs_data$log_egg_prices <- log(eggs_data$egg_prices)
plot_ly(eggs_data, x = ~as.Date(Date)) %>%
add_lines(y = ~egg_prices, name = 'Original Prices', line = list(color = 'blue')) %>%
add_lines(y = ~log_egg_prices, name = 'Log-Transformed Prices', line = list(color = 'red')) %>%
layout(title = "Original vs Log-Transformed Egg Prices",
xaxis = list(title = "Date"),
yaxis = list(title = "Price"))Note: We are currently working with a univariate time series, focusing solely on Eggs prices.
library(patchwork)
###### Convert to Time Series Object ######
ts_eggs <- ts(eggs_data$egg_prices,
start = decimal_date(as.Date(min(eggs_data$Date))),
frequency = 12) # Assuming monthly data
ts_log_eggs <- ts(eggs_data$log_egg_prices,
start = decimal_date(as.Date(min(eggs_data$Date))),
frequency = 12)
###### Differencing the Data ######
diff_eggs <- diff(ts_eggs)
diff_log_eggs <- diff(ts_log_eggs)
# Create plots for differenced data
p1 <- ggplot() +
geom_line(aes(x = time(diff_eggs), y = diff_eggs), color = "blue") +
labs(title = "Differenced Egg Prices",
x = "Time",
y = "Differenced Prices") +
theme_minimal()
p2 <- ggplot() +
geom_line(aes(x = time(diff_log_eggs), y = diff_log_eggs), color = "red") +
labs(title = "Differenced Log-Transformed Egg Prices",
x = "Time",
y = "Differenced Log Prices") +
theme_minimal()
# Combine plots using patchwork
p1 / p2It is clear from the above data that the log transformed data should be used.
Seasonality
Check the seasonality:
###### Decompose the Time Series and Plot ######
autoplot(decompose(ts_log_eggs)) +
ggtitle("Decomposition of Log Transformed Egg Prices") +
xlab("Date") +
ylab("Value")There is a clear seasonal component here.
Moreover, check the lag plots for seasonal correlation at lags 12, 24, 36, 48.
###### Plot Lag Plot using gglagplot ######
gglagplot(ts_log_eggs, do.lines = FALSE, set.lags = c(12, 24, 36, 48)) +
ggtitle("Lag Plot of Log Transformed Egg Prices")There seem to be correlation at seasonal lag 1 and 2.
###### Plot ACF of Log Transformed Egg Prices using ggAcf ######
ggAcf(ts_log_eggs,50) +
ggtitle("ACF of Log Transformed Egg Prices")However, the above ACF plot makes it a little hard to identify seasonal correlations. This difficulty may stem from the strong autocorrelation caused by the underlying trend in the data.
Differencing the data
Ordinary Differencing
###### First and Second Order Differencing for Log Egg Prices ######
# First-order differencing
diff_log_1 <- diff(ts_log_eggs)
# Second-order differencing
diff_log_2 <- diff(ts_log_eggs, differences = 2)
###### Plot ACF for First and Second Order Differenced Log Series ######
# ACF plot for first-order differenced log series
acf_log_plot_1 <- ggAcf(diff_log_1, 50) +
ggtitle("ACF of First-Order Differenced Log Series") +
theme_minimal()
# ACF plot for second-order differenced log series
acf_log_plot_2 <- ggAcf(diff_log_2, 50) +
ggtitle("ACF of Second-Order Differenced Log Series") +
theme_minimal()
# Display both ACF plots side by side
acf_log_plot_1 / acf_log_plot_2The above plots clearly indicate that first-order differencing is sufficient, while second-order differencing results in over-differencing. Additionally, strong seasonal correlations are evident at seasonal lags of 12, 24, 36, and so on.
Seasonal Differencing
###### Apply Only Seasonal Differencing (Lag = 12) ######
seasonal_only_diff <- diff(ts_log_eggs, lag = 12)
###### Plot ACF for the Only Seasonally Differenced Series ######
ggAcf(seasonal_only_diff) +
ggtitle("ACF of Only Seasonally Differenced Log Egg Prices") +
theme_minimal()Seasonal differencing alone is insufficient, as significant autocorrelation from the underlying trend components remains in the series.
Both Ordinary Differencing and Seasonal differencing
###### Apply First-Order Differencing ######
first_diff <- diff(ts_log_eggs)
###### Apply Seasonal Differencing (Lag = 12) on Top of First-Order Differencing ######
seasonal_diff <- diff(first_diff, lag = 12)
###### Plot ACF and PACF for the Differenced Series ######
ggAcf(seasonal_diff) +
ggtitle("ACF of Seasonally Differenced Log Egg Prices") +
theme_minimal()This series appears closer to weak stationarity. The ACF plot shows clear autocorrelations at lags q = 0,2, 4 and Q = 1.
Additionally, both first-order differencing (d = 1) and seasonal differencing (D = 1) have been applied to the data.
Let’s check the PACF plot to check the AR parameters.
ggPacf(seasonal_diff) +
ggtitle("PACF of Seasonally Differenced Log Egg Prices") +
theme_minimal()The PACF plot shows clear autocorrelations at lags p = 0,2,3 and P = 0,1,2.
Parameter Search
###### Load Required Libraries ######
library(knitr)
library(kableExtra)
###### Initialize Variables ######
# Create an empty list to store models
SARIMA_fit <- list()
# Create an empty matrix to store results
results_matrix <- matrix(rep(NA, 9 * 4), nrow = 4)
# Initialize index for matrix row
cc <- 1
###### Loop through SARIMA Model Parameters ######
# Parameters: p = 2; d = 1; q = 2,4; P = 1,2; D = 1; Q = 1; s = 12
p <- 2
for (q in c(2, 4)) {
for (P in 1:2) {
Q <- 1
# Fit SARIMA model
model <- Arima(ts_log_eggs, order = c(p, 1, q), seasonal = c(P, 1, Q))
# Store model in list
SARIMA_fit[[cc]] <- model
# Store model parameters and AIC/BIC/AICc values in matrix
results_matrix[cc, ] <- c(p, 1, q, P, 1, Q, model$aic, model$bic, model$aicc)
# Increment row index
cc <- cc + 1
}
}
###### Convert Matrix to Data Frame ######
results_df <- as.data.frame(na.omit(results_matrix))
colnames(results_df) <- c("p", "d", "q", "P", "D", "Q", "AIC", "BIC", "AICc")
###### Find the Row with the Minimum AIC ######
highlight_row <- which.min(results_df$AIC)
###### Generate kable Table with Highlighting for the Row with the Minimum AIC ######
knitr::kable(results_df, align = 'c', caption = "Comparison of SARIMA Models") %>%
kable_styling(full_width = FALSE, position = "center") %>%
row_spec(highlight_row, bold = TRUE, background = "#FFFF99")| p | d | q | P | D | Q | AIC | BIC | AICc |
|---|---|---|---|---|---|---|---|---|
| 2 | 1 | 2 | 1 | 1 | 1 | -1383.800 | -1353.930 | -1383.584 |
| 2 | 1 | 2 | 2 | 1 | 1 | -1381.869 | -1347.732 | -1381.591 |
| 2 | 1 | 4 | 1 | 1 | 1 | -1382.705 | -1344.300 | -1382.356 |
| 2 | 1 | 4 | 2 | 1 | 1 | -1379.765 | -1337.093 | -1379.339 |
This shows that the lowest AIC and BIC is for ARIMA(2,1,1)x(1,1,1)[12].
I’m going to check with different parameters.
###### Define SARIMA Model Comparison Function ######
SARIMA.c = function(p1, p2, q1, q2, P1, P2, Q1, Q2, data) {
# Set differencing orders and seasonal period
d = 1
D = 1
s = 12
# Initialize result storage
i = 1
results_matrix = matrix(NA, nrow = 100, ncol = 9)
# Iterate over parameter ranges
for (p in p1:p2) {
for (q in q1:q2) {
for (P in P1:P2) {
for (Q in Q1:Q2) {
# Apply parameter constraint to avoid overfitting
if (p + d + q + P + D + Q <= 10) {
# Fit SARIMA model
model <- Arima(data, order = c(p, d, q), seasonal = c(P, D, Q))
# Store parameters and model selection criteria
results_matrix[i, ] <- c(p, d, q, P, D, Q, model$aic, model$bic, model$aicc)
i = i + 1
}
}
}
}
}
# Convert results to data frame and label columns
results_df = as.data.frame(na.omit(results_matrix))
names(results_df) = c("p", "d", "q", "P", "D", "Q", "AIC", "BIC", "AICc")
return(results_df)
}
###### Run SARIMA Model Comparison ######
output = SARIMA.c(p1 = 0, p2 = 3, q1 = 0, q2 = 3, P1 = 0, P2 = 2, Q1 = 0, Q2 = 1, data = ts_log_eggs)
###### Identify Models with Minimum AIC and BIC ######
minaic = output[which.min(output$AIC), ]
minbic = output[which.min(output$BIC), ]
###### Display Best Models Based on AIC and BIC ######
minaic p d q P D Q AIC BIC AICc
90 3 1 2 2 1 1 -1388.925 -1350.52 -1388.577
minbic p d q P D Q AIC BIC AICc
14 0 1 2 0 1 1 -1387.615 -1370.546 -1387.538
This results in ARIMA(3,1,2)x(2,1,1)[12] and ARIMA(0,1,2)x(0,1,1)[12].
Checking with auto.arima()
auto.arima(ts_log_eggs)Series: ts_log_eggs
ARIMA(3,1,2)(2,0,0)[12]
Coefficients:
ar1 ar2 ar3 ma1 ma2 sar1 sar2
1.3245 -0.5085 -0.1461 -1.4046 0.7374 0.1132 0.1949
s.e. 0.2002 0.1666 0.0498 0.1976 0.1426 0.0515 0.0475
sigma^2 = 0.00446: log likelihood = 696.69
AIC=-1377.38 AICc=-1377.11 BIC=-1343.06
This shows the good fit is ARIMA(3,1,2)x(2,0,0)[12].
Model Diagnostics
We have several choices for our models. Let’s check the model diagnostics to find the best model.
The Models chosen by the above analysis:
- ARIMA(2,1,1)x(1,1,1)[12]
- ARIMA(3,1,2)x(2,0,0)[12]
- ARIMA(3,1,2)x(2,1,1)[12]
- ARIMA(0,1,2)x(0,1,1)[12]
ARIMA(2,1,1)x(1,1,1)[12]
###### Set Seed for Reproducibility ######
set.seed(345)
###### Capture ARIMA Model Output for Diagnostics ######
# Model diagnostics for ARIMA(2,1,1)x(1,1,1)[12]
model_output <- capture.output(sarima(ts_log_eggs, 2, 1, 1, 1, 1, 1, 12))# Find the line numbers dynamically based on a keyword
start_line <- grep("Coefficients", model_output) # Locate where coefficient details start
end_line <- length(model_output) # Last line of output
# Print the relevant section automatically
cat(model_output[start_line:end_line], sep = "\n")Coefficients:
Estimate SE t.value p.value
ar1 0.1891 0.1973 0.9581 0.3385
ar2 0.1903 0.0461 4.1308 0.0000
ma1 -0.3120 0.1984 -1.5722 0.1165
sar1 -0.0472 0.0494 -0.9561 0.3395
sma1 -0.9449 0.0298 -31.7492 0.0000
sigma^2 estimated as 0.003918233 on 522 degrees of freedom
AIC = -2.628597 AICc = -2.628378 BIC = -2.580014
The Residual Plot shows nearly consistent fluctuation around zero, suggesting that the residuals are nearly stationary with a constant mean and finite variance over time.
The Autocorrelation Function (ACF) of the residuals reveals no significant autocorrelations indicating a good model.
The Q-Q Plot indicates that the residuals follow a near-normal distribution, with minor deviations at the tails, which is typical in time series data.
The Ljung-Box Test p-values remain mostly above the 0.05 significance level, implying a nearly good fit.
Coefficient Significance: Here coefficients for ar1, ma1, sar1 are not significant. This may question about this model being a good fit based on this results.
ARIMA(3,1,2)x(2,0,0)[12]
###### Load Necessary Libraries ######
library(astsa)
###### Set Seed for Reproducibility ######
set.seed(345)
###### Capture ARIMA Model Output for Diagnostics ######
# Model diagnostics for ARIMA(3,1,2)x(2,0,0)[12]
model_output <- capture.output(sarima(ts_log_eggs, 3, 1, 2, 2, 0, 0, 12))# Find the line numbers dynamically based on a keyword
start_line <- grep("Coefficients", model_output) # Locate where coefficient details start
end_line <- length(model_output) # Last line of output
# Print the relevant section automatically
cat(model_output[start_line:end_line], sep = "\n")Coefficients:
Estimate SE t.value p.value
ar1 1.3388 0.2060 6.4980 0.0000
ar2 -0.5177 0.1765 -2.9326 0.0035
ar3 -0.1483 0.0482 -3.0741 0.0022
ma1 -1.4196 0.2042 -6.9528 0.0000
ma2 0.7467 0.1519 4.9173 0.0000
sar1 0.1100 0.0529 2.0797 0.0380
sar2 0.1924 0.0483 3.9806 0.0001
constant 0.0029 0.0040 0.7198 0.4719
sigma^2 estimated as 0.004398357 on 531 degrees of freedom
AIC = -2.552677 AICc = -2.552173 BIC = -2.481049
The Residual Plot shows nearly consistent fluctuation around zero, suggesting that the residuals are nearly stationary with a constant mean and finite variance over time.
The Autocorrelation Function (ACF) of the residuals reveals no significant autocorrelations indicating a good model.
The Q-Q Plot indicates that the residuals follow a near-normal distribution, with minor deviations at the tails, which is typical in time series data.
The Ljung-Box Test p-values remain mostly above the 0.05 significance level, implying a nearly good fit.
Coefficient Significance: Here coefficients for ar2, ar3, sar2 are not significant. This may question about this model being a good fit based on this results.
ARIMA(3,1,2)x(2,1,1)[12]
###### Set Seed for Reproducibility ######
set.seed(345)
###### Capture ARIMA Model Output for Diagnostics ######
# Model diagnostics for ARIMA(3,1,2)x(2,1,1)[12]
model_output <- capture.output(sarima(ts_log_eggs, 3, 1, 2, 2, 1, 1, 12))# Find the line numbers dynamically based on a keyword
start_line <- grep("Coefficients", model_output) # Locate where coefficient details start
end_line <- length(model_output) # Last line of output
# Print the relevant section automatically
cat(model_output[start_line:end_line], sep = "\n")Coefficients:
Estimate SE t.value p.value
ar1 1.4825 0.1562 9.4897 0.0000
ar2 -0.3041 0.1712 -1.7763 0.0763
ar3 -0.2368 0.0438 -5.4086 0.0000
ma1 -1.6279 0.1594 -10.2147 0.0000
ma2 0.6625 0.1591 4.1649 0.0000
sar1 0.0087 0.0535 0.1626 0.8709
sar2 0.0168 0.0517 0.3254 0.7450
sma1 -0.9540 0.0333 -28.6119 0.0000
sigma^2 estimated as 0.003834145 on 519 degrees of freedom
AIC = -2.635531 AICc = -2.635004 BIC = -2.562657
ARIMA(0,1,2)x(0,1,1)[12]
###### Set Seed for Reproducibility ######
set.seed(345)
###### Capture ARIMA Model Output for Diagnostics ######
# Model diagnostics for ARIMA(0,1,2)x(0,1,1)[12]
model_output <- capture.output(sarima(ts_log_eggs, 0, 1, 2, 0, 1, 1, 12))# Find the line numbers dynamically based on a keyword
start_line <- grep("Coefficients", model_output) # Locate where coefficient details start
end_line <- length(model_output) # Last line of output
# Print the relevant section automatically
cat(model_output[start_line:end_line], sep = "\n")Coefficients:
Estimate SE t.value p.value
ma1 -0.1281 0.0438 -2.9248 0.0036
ma2 0.1673 0.0422 3.9686 0.0001
sma1 -0.9572 0.0290 -32.9526 0.0000
sigma^2 estimated as 0.003917763 on 524 degrees of freedom
AIC = -2.633045 AICc = -2.632958 BIC = -2.600656
Comparing all the model diagnostic models and lower AIC values, we will choose ARIMA(3,1,2)x(2,1,1)[12].
Fit the model
###### Fit the Final ARIMA Model ######
# Based on diagnostics and AIC values, ARIMA(3,1,2)x(2,1,1)[12] was selected
fit <- Arima(ts_log_eggs, order = c(3, 1, 2), seasonal = c(2, 1, 1))
###### Display Model Summary ######
summary(fit)Series: ts_log_eggs
ARIMA(3,1,2)(2,1,1)[12]
Coefficients:
ar1 ar2 ar3 ma1 ma2 sar1 sar2 sma1
1.4825 -0.3041 -0.2368 -1.6279 0.6625 0.0087 0.0168 -0.9540
s.e. 0.1562 0.1712 0.0438 0.1594 0.1591 0.0535 0.0517 0.0333
sigma^2 = 0.003893: log likelihood = 703.46
AIC=-1388.92 AICc=-1388.58 BIC=-1350.52
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set 0.001965082 0.06117061 0.04498489 -12.92247 66.23099 0.2957145
ACF1
Training set 0.005535244
Forecaseting
###### Forecast the Next 100 Periods ######
forecast_result <- forecast(fit, h = 120)
###### Display Forecast Accuracy ######
accuracy(forecast_result) ME RMSE MAE MPE MAPE MASE
Training set 0.001965082 0.06117061 0.04498489 -12.92247 66.23099 0.2957145
ACF1
Training set 0.005535244
###### Plot the Forecast ######
autoplot(forecast_result) +
labs(title = "ARIMA(3,1,2)x(2,1,1)[12] Forecast",
x = "Time",
y = "Predicted Log Egg Prices") +
theme_minimal()Benchmark Methods
###### Forecast Comparison Plot ######
autoplot(ts_log_eggs) +
autolayer(meanf(ts_log_eggs, h = 120), series = "Mean", PI = FALSE) +
autolayer(naive(ts_log_eggs, h = 120), series = "Naïve", PI = FALSE) +
autolayer(snaive(ts_log_eggs, h = 120), series = "SNaïve", PI = FALSE) +
autolayer(rwf(ts_log_eggs, h = 120, drift = TRUE), series = "Drift", PI = FALSE) +
autolayer(forecast(fit, h = 120), series = "SARIMA Fit", PI = FALSE) +
guides(colour = guide_legend(title = "Forecast Methods")) +
ggtitle("Forecast Comparison for Log Transformed Egg Prices") +
ylab("Log Egg Prices") +
xlab("Time")Based on the benchmark methods it is clear that the SARIMA model was able to catch the trend in the egg prices.
Original Vs. Fitted Values
For the log transformed data
###### Forecast the Next 365 Periods ######
forecast_result <- forecast(fit, h = 365)
###### Display Forecast Accuracy ######
accuracy(forecast_result) ME RMSE MAE MPE MAPE MASE
Training set 0.001965082 0.06117061 0.04498489 -12.92247 66.23099 0.2957145
ACF1
Training set 0.005535244
###### Plot Original Data vs Fitted Values ######
autoplot(ts_log_eggs, series = "Original Data") +
autolayer(forecast_result$fitted, series = "Fitted Values", PI = FALSE) +
labs(title = "Original Data vs Fitted Values",
x = "Time",
y = "Log Egg Prices") +
theme_minimal() +
guides(colour = guide_legend(title = "Series"))For the original data
###### Forecast the Next 365 Periods ######
forecast_result <- forecast(fit, h = 365)
###### Display Forecast Accuracy ######
accuracy(forecast_result) ME RMSE MAE MPE MAPE MASE
Training set 0.001965082 0.06117061 0.04498489 -12.92247 66.23099 0.2957145
ACF1
Training set 0.005535244
###### Plot Original Egg Prices vs Fitted Values ######
# Ensure Date is in Date format
eggs_data$Date <- as.Date(eggs_data$Date)
# Convert to time series format
start_year <- as.numeric(format(min(eggs_data$Date), "%Y"))
start_month <- as.numeric(format(min(eggs_data$Date), "%m"))
egg_prices_ts <- ts(eggs_data$egg_prices, start = c(start_year, start_month), frequency = 12)
fitted_ts <- ts(exp(forecast_result$fitted), start = c(start_year, start_month), frequency = 12)
# Plot
autoplot(egg_prices_ts, series = "Original Data") +
autolayer(fitted_ts, series = "Fitted Values", PI = FALSE) +
labs(title = "Original Egg Prices vs Fitted Values",
x = "Time",
y = "Egg Prices") +
theme_minimal() +
guides(colour = guide_legend(title = "Series"))Very close prediction.
Prophet model
Remember the prophet model works better when there is strong seasonality.
###### Load Required Libraries ######
library(prophet)
library(dplyr)
###### Prepare the Dataset for Prophet ######
# Convert the log-transformed egg prices data for Prophet
eggs_data_prophet <- eggs_data %>%
select(Date, log_egg_prices) %>%
rename(ds = Date, y = log_egg_prices)
###### Initialize and Fit the Prophet Model ######
prophet_model <- prophet(eggs_data_prophet)
###### Extend the Forecast Horizon by 365 Days ######
future_dates <- make_future_dataframe(prophet_model, periods = 365)
###### Generate Future Predictions ######
predictions <- predict(prophet_model, future_dates)
###### Visualize the Forecast ######
plot(prophet_model, predictions) +
labs(title = "Prophet Forecast for Log Egg Prices",
x = "Date",
y = "Log Egg Prices")Example 2:
US monthly natural gas consumption: 2000 - 2019. Units: Billion Cubic Feet
- TS plot and the ACF
###### Plot US Monthly Natural Gas Consumption ######
autoplot(USgas) +
ggtitle("US Monthly Natural Gas Consumption") +
xlab("Year") +
ylab("Gas Consumption") +
theme_minimal()###### ACF Plot Using Base R ######
acf(USgas, main = "ACF of US Monthly Natural Gas Consumption")###### ACF Plot Using ggAcf ######
ggAcf(USgas, lag.max = 40) +
ggtitle("ACF of US Monthly Natural Gas Consumption (40 Lags)") you can see the difference of the ACF plot when using ggAcf() and acf() function. When you use ggAcf(), the seasonal lags are at \(12,24,36..\) but acf() indicate \(1,2,..\) for seasonal lags and decimal values for ordinary lags.
- Lag Plot
########## Interactive plots#########
ts_lags(USgas,lags=c(12, 24, 36, 48)) #ts_lags(USgas, lags = c(12, 24, 36, 48))
######################################
gglagplot(USgas, do.lines=FALSE, set.lags = c(12, 24, 36, 48))+ggtitle("US monthly natural gas consumption") Apparent seasonal correlation can be seen by the positive correlation at seasonal lags 12,24,36,48..
- Decomposing
We are using decompose() function to look at the seasonal component.
dec2=decompose(USgas,type = c("additive", "multiplicative"))
plot(dec2)- Differencing
str(USgas) # this is already time series Time-Series [1:238] from 2000 to 2020: 2510 2331 2051 1783 1633 ...
USgas %>% diff() %>% ggtsdisplay() #first ordinary differencingUSgas %>% diff(lag=12) %>% ggtsdisplay() #first seasonal differencingUSgas %>% diff(lag=12) %>% diff() %>% ggtsdisplay() #do bothYou can see that the first ordinary differencing is not enough because we can clearly see the seasonal correlation. Therefore, we need to proceed with the seasonal differencing.
Second set of plots : only seasonal differencing. It looks like there aren’t much correlation left when we do the seasonal differencing. Therefore, it could be D=1 and d=0. Let’s keep this as one option.
Third set of plots: doing seasonal differencing and ordinary differencing. looks more stationary.
Here: by ACF Plot: q=0,1,2,3; Q=1,2 and PACF plot: p=0,1,2; P=1,2
I’m choosing 0 because when you have more parameters, sometimes, some parameters are insignificant in the presence of other parameters.
- Model Fitting
######################## Check for different combinations ########
#write a funtion
SARIMA.c=function(p1,p2,q1,q2,P1,P2,Q1,Q2,data){
#K=(p2+1)*(q2+1)*(P2+1)*(Q2+1)
temp=c()
d=1
D=1
s=12
i=1
temp= data.frame()
ls=matrix(rep(NA,9*35),nrow=35)
for (p in p1:p2)
{
for(q in q1:q2)
{
for(P in P1:P2)
{
for(Q in Q1:Q2)
{
if(p+d+q+P+D+Q<=9)
{
model<- Arima(data,order=c(p-1,d,q-1),seasonal=c(P-1,D,Q-1))
ls[i,]= c(p-1,d,q-1,P-1,D,Q-1,model$aic,model$bic,model$aicc)
i=i+1
#print(i)
}
}
}
}
}
temp= as.data.frame(ls)
names(temp)= c("p","d","q","P","D","Q","AIC","BIC","AICc")
temp
}# q=0,1,2,3; Q=1,2 and PACF plot: p=0,1,2; P=1,2, D=1 and d=0,1
output=SARIMA.c(p1=1,p2=3,q1=1,q2=4,P1=1,P2=3,Q1=1,Q2=3,data=USgas)
#output
knitr::kable(output)| p | d | q | P | D | Q | AIC | BIC | AICc |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 0 | 1 | 0 | 2892.631 | 2896.047 | 2892.649 |
| 0 | 1 | 0 | 0 | 1 | 1 | 2805.310 | 2812.142 | 2805.364 |
| 0 | 1 | 0 | 0 | 1 | 2 | 2805.986 | 2816.234 | 2806.094 |
| 0 | 1 | 0 | 1 | 1 | 0 | 2866.355 | 2873.187 | 2866.409 |
| 0 | 1 | 0 | 1 | 1 | 1 | 2806.594 | 2816.842 | 2806.702 |
| 0 | 1 | 0 | 1 | 1 | 2 | 2805.556 | 2819.221 | 2805.738 |
| 0 | 1 | 0 | 2 | 1 | 0 | 2828.098 | 2838.346 | 2828.206 |
| 0 | 1 | 0 | 2 | 1 | 1 | 2798.563 | 2812.227 | 2798.745 |
| 0 | 1 | 1 | 0 | 1 | 0 | 2854.973 | 2861.805 | 2855.027 |
| 0 | 1 | 1 | 0 | 1 | 1 | 2769.757 | 2780.006 | 2769.866 |
| 0 | 1 | 1 | 0 | 1 | 2 | 2769.439 | 2783.103 | 2769.621 |
| 0 | 1 | 1 | 1 | 1 | 0 | 2831.546 | 2841.794 | 2831.655 |
| 0 | 1 | 1 | 1 | 1 | 1 | 2770.495 | 2784.159 | 2770.677 |
| 0 | 1 | 1 | 2 | 1 | 0 | 2797.594 | 2811.259 | 2797.776 |
| 0 | 1 | 2 | 0 | 1 | 0 | 2831.956 | 2842.204 | 2832.064 |
| 0 | 1 | 2 | 0 | 1 | 1 | 2754.034 | 2767.699 | 2754.216 |
| 0 | 1 | 2 | 1 | 1 | 0 | 2809.699 | 2823.364 | 2809.881 |
| 0 | 1 | 3 | 0 | 1 | 0 | 2833.595 | 2847.260 | 2833.777 |
| 1 | 1 | 0 | 0 | 1 | 0 | 2878.839 | 2885.672 | 2878.893 |
| 1 | 1 | 0 | 0 | 1 | 1 | 2789.262 | 2799.511 | 2789.371 |
| 1 | 1 | 0 | 0 | 1 | 2 | 2789.799 | 2803.464 | 2789.981 |
| 1 | 1 | 0 | 1 | 1 | 0 | 2852.613 | 2862.862 | 2852.722 |
| 1 | 1 | 0 | 1 | 1 | 1 | 2790.451 | 2804.116 | 2790.633 |
| 1 | 1 | 0 | 2 | 1 | 0 | 2816.768 | 2830.433 | 2816.950 |
| 1 | 1 | 1 | 0 | 1 | 0 | 2831.655 | 2841.903 | 2831.763 |
| 1 | 1 | 1 | 0 | 1 | 1 | 2752.973 | 2766.637 | 2753.155 |
| 1 | 1 | 1 | 1 | 1 | 0 | 2809.715 | 2823.379 | 2809.897 |
| 1 | 1 | 2 | 0 | 1 | 0 | 2833.110 | 2846.775 | 2833.292 |
| 2 | 1 | 0 | 0 | 1 | 0 | 2845.213 | 2855.462 | 2845.322 |
| 2 | 1 | 0 | 0 | 1 | 1 | 2768.783 | 2782.447 | 2768.964 |
| 2 | 1 | 0 | 1 | 1 | 0 | 2822.087 | 2835.751 | 2822.268 |
| 2 | 1 | 1 | 0 | 1 | 0 | 2846.528 | 2860.193 | 2846.710 |
| NA | NA | NA | NA | NA | NA | NA | NA | NA |
| NA | NA | NA | NA | NA | NA | NA | NA | NA |
| NA | NA | NA | NA | NA | NA | NA | NA | NA |
output[which.min(output$AIC),] p d q P D Q AIC BIC AICc
26 1 1 1 0 1 1 2752.973 2766.637 2753.155
output[which.min(output$BIC),] p d q P D Q AIC BIC AICc
26 1 1 1 0 1 1 2752.973 2766.637 2753.155
output[which.min(output$AICc),] p d q P D Q AIC BIC AICc
26 1 1 1 0 1 1 2752.973 2766.637 2753.155
ARIMA(1,1,1)x(0,1,1)12 is the model suggested above.
###### Set Seed for Reproducibility ######
set.seed(123)
###### Fit SARIMA Model and Capture Output ######
model_output <- capture.output(sarima(USgas, 1, 1, 1, 0, 1, 1, 12))###### Find the Line Numbers Dynamically Based on a Keyword ######
start_line <- grep("Coefficients", model_output) # Locate where coefficient details start
end_line <- length(model_output) # Last line of output
###### Print the Relevant Section Automatically ######
cat(model_output[start_line:end_line], sep = "\n")Coefficients:
Estimate SE t.value p.value
ar1 0.4112 0.0758 5.4229 0
ma1 -0.9072 0.0370 -24.4854 0
sma1 -0.8072 0.0484 -16.6757 0
sigma^2 estimated as 10925.36 on 222 degrees of freedom
AIC = 12.23543 AICc = 12.23592 BIC = 12.29616
The Standard Residual Plot appears good, displaying okay stationarity with a nearly constant mean and variation.
The Autocorrelation Function (ACF) plot shows almost no correlation indicating that the model has harnessed everything that left is white noise. This indicates a good model fit.
The Quantile-Quantile (Q-Q) Plot still demonstrates near-normality.
The Ljung-Box test results reveal values below the 0.05 (5% significance) threshold, indicating there’s some significant correlation left.
$ttable: all coefficients are significant.
- Model Fitting
fit <- Arima(USgas, order=c(1,1,1), seasonal=c(0,1,1))
summary(fit)Series: USgas
ARIMA(1,1,1)(0,1,1)[12]
Coefficients:
ar1 ma1 sma1
0.4112 -0.9072 -0.8072
s.e. 0.0758 0.0370 0.0484
sigma^2 = 11073: log likelihood = -1372.49
AIC=2752.97 AICc=2753.15 BIC=2766.64
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set 6.818919 101.6306 75.78126 0.1440502 3.564759 0.6519017
ACF1
Training set -0.003844906
- Forecasting
fit %>% forecast(h=36) %>% autoplot() #next 3 yearssarima.for(USgas, 36, 1,1,1,0,1,1,12)$pred
Jan Feb Mar Apr May Jun Jul Aug
2019
2020 3445.352 3024.296 2910.105 2403.679 2254.036 2256.930 2475.676 2479.352
2021 3505.170 3089.649 2977.734 2472.244 2322.987 2326.038 2544.849 2548.552
2022 3574.389 3158.868 3046.953 2541.463 2392.206 2395.257 2614.068 2617.771
Sep Oct Nov Dec
2019 2694.431 3158.320
2020 2285.920 2406.314 2708.053 3204.677
2021 2355.131 2475.530 2777.271 3273.895
2022 2424.350 2544.749
$se
Jan Feb Mar Apr May Jun Jul Aug
2019
2020 121.1862 123.2756 124.7305 125.9540 127.0832 128.1697 129.2337 130.2838
2021 141.6944 143.2182 144.6274 145.9839 147.3121 148.6220 149.9179 151.2016
2022 163.6161 165.3688 167.0047 168.5853 170.1355 171.6652 173.1787 174.6781
Sep Oct Nov Dec
2019 104.5300 117.0569
2020 131.3232 132.3536 137.3318 139.8920
2021 152.4741 153.7359 158.8085 161.5838
2022 176.1643 177.6380
- Compare with Benchmark methods
fit <- Arima(USgas, order=c(1,1,1), seasonal=c(0,1,1))
autoplot(USgas) +
autolayer(meanf(USgas, h=36),
series="Mean", PI=FALSE) +
autolayer(naive(USgas, h=36),
series="Naïve", PI=FALSE) +
autolayer(snaive(USgas, h=36),
series="SNaïve", PI=FALSE)+
autolayer(rwf(USgas, h=36, drift=TRUE),
series="Drift", PI=FALSE)+
autolayer(forecast(fit,36),
series="fit",PI=FALSE) +
guides(colour=guide_legend(title="Forecast"))fit looks a little better than snaive because the fit kind of captures the trend.
f2 <- snaive(USgas, h=36)
accuracy(f2) ME RMSE MAE MPE MAPE MASE ACF1
Training set 37.31549 149.597 116.2465 1.498393 5.479521 1 0.468312
summary(fit)Series: USgas
ARIMA(1,1,1)(0,1,1)[12]
Coefficients:
ar1 ma1 sma1
0.4112 -0.9072 -0.8072
s.e. 0.0758 0.0370 0.0484
sigma^2 = 11073: log likelihood = -1372.49
AIC=2752.97 AICc=2753.15 BIC=2766.64
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set 6.818919 101.6306 75.78126 0.1440502 3.564759 0.6519017
ACF1
Training set -0.003844906
Model error measurements are much lower than snaive benchmark method.
Therefore, our fitted model is good.
Example 3: Monthly Airline Passenger Numbers 1949-1960
This follows the same procedure as we discussed before.
###### Load and Inspect AirPassengers Data ######
x <- AirPassengers
str(x) Time-Series [1:144] from 1949 to 1961: 112 118 132 129 121 135 148 148 136 119 ...
###### Plot Original Data ######
autoplot(x) +
ggtitle("AirPassengers Data") +
xlab("Year") +
ylab("Number of Passengers") +
theme_minimal()###### Apply Log Transformation and Differencing ######
lx <- log(x)
dlx <- diff(lx)
ddlx <- diff(dlx, lag = 12)
###### Plot Transformed Series ######
plot.ts(cbind(Original = x, Log_Transformed = lx, First_Diff = dlx, Seasonal_Diff = ddlx),
main = "AirPassengers Data Transformations")###### Seasonal Plots ######
par(mfrow = c(2, 1))
monthplot(dlx, main = "Month Plot of First Differenced Log Data")
monthplot(ddlx, main = "Month Plot of Seasonal Differenced Data")###### ACF and PACF Plots ######
acf2(ddlx, 50) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
ACF -0.34 0.11 -0.20 0.02 0.06 0.03 -0.06 0.00 0.18 -0.08 0.06 -0.39 0.15
PACF -0.34 -0.01 -0.19 -0.13 0.03 0.03 -0.06 -0.02 0.23 0.04 0.05 -0.34 -0.11
[,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
ACF -0.06 0.15 -0.14 0.07 0.02 -0.01 -0.12 0.04 -0.09 0.22 -0.02 -0.1
PACF -0.08 -0.02 -0.14 0.03 0.11 -0.01 -0.17 0.13 -0.07 0.14 -0.07 -0.1
[,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37]
ACF 0.05 -0.03 0.05 -0.02 -0.05 -0.05 0.20 -0.12 0.08 -0.15 -0.01 0.05
PACF -0.01 0.04 -0.09 0.05 0.00 -0.10 -0.02 0.01 -0.02 0.02 -0.16 -0.03
[,38] [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47] [,48] [,49]
ACF 0.03 -0.02 -0.03 -0.07 0.10 -0.09 0.03 -0.04 -0.04 0.11 -0.05 0.11
PACF 0.01 0.05 -0.08 -0.17 0.07 -0.10 -0.06 -0.03 -0.12 -0.01 -0.05 0.09
[,50]
ACF -0.02
PACF 0.13
log(x) %>% diff() %>% ggtsdisplay()log(x) %>% diff(12) %>% ggtsdisplay()log(x) %>% diff() %>%diff(12) %>% ggtsdisplay() #this is better#q=1,3; Q=1; p=1,3; P=1# #q=1,3; Q=1; p=1,3; P=1, D=1 and d=0,1
output=SARIMA.c(p1=1,p2=4,q1=1,q2=4,P1=1,P2=2,Q1=1,Q2=2,data=log(x))
#output
#knitr::kable(output)
output[which.min(output$AIC),] p d q P D Q AIC BIC AICc
6 0 1 1 0 1 1 -483.3991 -474.7735 -483.2101
output[which.min(output$BIC),] p d q P D Q AIC BIC AICc
6 0 1 1 0 1 1 -483.3991 -474.7735 -483.2101
output[which.min(output$AICc),] p d q P D Q AIC BIC AICc
6 0 1 1 0 1 1 -483.3991 -474.7735 -483.2101
###### Fit SARIMA Models and Display Diagnostics ######
# ARIMA(1,1,1)x(0,1,1)[12]
model_output11 <- capture.output(sarima(lx, 1, 1, 1, 0, 1, 1, 12))cat(model_output11[grep("Coefficients", model_output11):length(model_output11)], sep = "\n")Coefficients:
Estimate SE t.value p.value
ar1 0.1960 0.2475 0.7921 0.4298
ma1 -0.5784 0.2132 -2.7127 0.0076
sma1 -0.5643 0.0747 -7.5544 0.0000
sigma^2 estimated as 0.001341097 on 128 degrees of freedom
AIC = -3.678622 AICc = -3.677179 BIC = -3.59083
# ARIMA(0,1,1)x(0,1,1)[12] - Best Model
model_output12 <- capture.output(sarima(lx, 0, 1, 1, 0, 1, 1, 12))cat(model_output12[grep("Coefficients", model_output12):length(model_output12)], sep = "\n")Coefficients:
Estimate SE t.value p.value
ma1 -0.4018 0.0896 -4.4825 0
sma1 -0.5569 0.0731 -7.6190 0
sigma^2 estimated as 0.001348035 on 129 degrees of freedom
AIC = -3.690069 AICc = -3.689354 BIC = -3.624225
# ARIMA(1,1,0)x(0,1,1)[12]
model_output13 <- capture.output(sarima(lx, 1, 1, 0, 0, 1, 1, 12))cat(model_output13[grep("Coefficients", model_output13):length(model_output13)], sep = "\n")Coefficients:
Estimate SE t.value p.value
ar1 -0.3395 0.0822 -4.1295 1e-04
sma1 -0.5619 0.0748 -7.5109 0e+00
sigma^2 estimated as 0.00136738 on 129 degrees of freedom
AIC = -3.675493 AICc = -3.674777 BIC = -3.609649
Forecasting
#chosen model : forecasting
sarima.for(lx, 12, 0,1,1, 0,1,1,12)$pred
Jan Feb Mar Apr May Jun Jul Aug
1961 6.110186 6.053775 6.171715 6.199300 6.232556 6.368779 6.507294 6.502906
Sep Oct Nov Dec
1961 6.324698 6.209008 6.063487 6.168025
$se
Jan Feb Mar Apr May Jun
1961 0.03671562 0.04278291 0.04809072 0.05286830 0.05724856 0.06131670
Jul Aug Sep Oct Nov Dec
1961 0.06513124 0.06873441 0.07215787 0.07542612 0.07855851 0.08157070
Benchmark: Seasonal Naive
fit=Arima(lx,order=c(0,1,1),seasonal=c(0,1,1))
autoplot(lx) +
autolayer(meanf(lx, h=12),
series="Mean", PI=FALSE) +
autolayer(naive(lx, h=12),
series="Naïve", PI=FALSE) +
autolayer(snaive(lx, h=12),
series="SNaïve", PI=FALSE)+
autolayer(rwf(lx, drift=TRUE, h=12),
series="Drift", PI=FALSE)+
autolayer(forecast(fit,12),
series="fit",PI=FALSE) +
guides(colour=guide_legend(title="Forecast"))###### Apply Seasonal Naive Forecast to Log-Transformed AirPassengers Data ######
f2 <- snaive(lx, h = 36)
###### Check Residuals of the Seasonal Naive Model ######
checkresiduals(f2)
Ljung-Box test
data: Residuals from Seasonal naive method
Q* = 267.65, df = 24, p-value < 2.2e-16
Model df: 0. Total lags used: 24
###### Evaluate Forecast Accuracy ######
accuracy(f2) ME RMSE MAE MPE MAPE MASE ACF1
Training set 0.1198215 0.134642 0.121042 2.159718 2.182627 1 0.7136876
###### Display Model Summary ######
summary(f2)
Forecast method: Seasonal naive method
Model Information:
Call: snaive(y = lx, h = 36)
Residual sd: 0.1346
Error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set 0.1198215 0.134642 0.121042 2.159718 2.182627 1 0.7136876
Forecasts:
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
Jan 1961 6.033086 5.860536 6.205637 5.769193 6.296980
Feb 1961 5.968708 5.796157 6.141258 5.704814 6.232601
Mar 1961 6.037871 5.865320 6.210422 5.773977 6.301764
Apr 1961 6.133398 5.960847 6.305949 5.869505 6.397291
May 1961 6.156979 5.984428 6.329530 5.893086 6.420872
Jun 1961 6.282267 6.109716 6.454817 6.018373 6.546160
Jul 1961 6.432940 6.260389 6.605491 6.169047 6.696834
Aug 1961 6.406880 6.234329 6.579431 6.142987 6.670773
Sep 1961 6.230481 6.057931 6.403032 5.966588 6.494375
Oct 1961 6.133398 5.960847 6.305949 5.869505 6.397291
Nov 1961 5.966147 5.793596 6.138697 5.702253 6.230040
Dec 1961 6.068426 5.895875 6.240976 5.804532 6.332319
Jan 1962 6.033086 5.789063 6.277110 5.659885 6.406288
Feb 1962 5.968708 5.724684 6.212731 5.595506 6.341909
Mar 1962 6.037871 5.793847 6.281894 5.664669 6.411073
Apr 1962 6.133398 5.889375 6.377422 5.760196 6.506600
May 1962 6.156979 5.912956 6.401002 5.783777 6.530181
Jun 1962 6.282267 6.038243 6.526290 5.909065 6.655468
Jul 1962 6.432940 6.188917 6.676964 6.059738 6.806142
Aug 1962 6.406880 6.162857 6.650903 6.033678 6.780082
Sep 1962 6.230481 5.986458 6.474505 5.857280 6.603683
Oct 1962 6.133398 5.889375 6.377422 5.760196 6.506600
Nov 1962 5.966147 5.722123 6.210170 5.592945 6.339348
Dec 1962 6.068426 5.824402 6.312449 5.695224 6.441627
Jan 1963 6.033086 5.734220 6.331953 5.576009 6.490163
Feb 1963 5.968708 5.669841 6.267574 5.511631 6.425784
Mar 1963 6.037871 5.739004 6.336737 5.580794 6.494948
Apr 1963 6.133398 5.834532 6.432265 5.676321 6.590475
May 1963 6.156979 5.858112 6.455845 5.699902 6.614056
Jun 1963 6.282267 5.983400 6.581133 5.825190 6.739344
Jul 1963 6.432940 6.134074 6.731807 5.975863 6.890017
Aug 1963 6.406880 6.108013 6.705746 5.949803 6.863957
Sep 1963 6.230481 5.931615 6.529348 5.773405 6.687558
Oct 1963 6.133398 5.834532 6.432265 5.676321 6.590475
Nov 1963 5.966147 5.667280 6.265013 5.509070 6.423224
Dec 1963 6.068426 5.769559 6.367292 5.611349 6.525502
Example 4 : Corticosteroid drug sales in Australia
We will try to forecast monthly corticosteroid drug sales in Australia. These are known as H02 drugs under the Anatomical Therapeutic Chemical classification scheme.
###### Plot Original H02 Sales Data ######
autoplot(h02) +
ggtitle("H02 Sales Data") +
xlab("Year") +
ylab("Sales (Million Scripts)") +
theme_minimal()###### Log-Transform H02 Sales Data ######
lh02 <- log(h02)
###### Plot Original and Log-Transformed H02 Sales Data ######
cbind("H02 Sales (Million Scripts)" = h02,
"Log H02 Sales" = lh02) %>%
autoplot(facets = TRUE) +
ggtitle("Original and Log-Transformed H02 Sales") +
xlab("Year") +
ylab("") +
theme_minimal()Data from July 1991 to June 2008 are plotted. There is a small increase in the variance with the level, so we take logarithms to stabilise the variance.
The data are strongly seasonal and obviously non-stationary, so seasonal differencing will be used. The seasonally differenced data are shown below. It is not clear at this point whether we should do another difference or not.
The last few observations appear to be different (more variable) from the earlier data. This may be due to the fact that data are sometimes revised when earlier sales are reported late.
###### Plot ACF of Log-Transformed H02 Sales Data ######
ggAcf(lh02) +
ggtitle("ACF of Log-Transformed H02 Sales")We can see high correlation and seasonal correlation. Therefore, let’s try seasonal differencing.
###### Apply Seasonal Differencing (D=1) to Log-Transformed H02 Sales Data ######
# Seasonal differencing using pipe
lh02 %>%
diff(lag = 12) %>%
ggtsdisplay(xlab = "Year",
main = "Seasonally Differenced H02 Scripts")# Alternative approach using explicit variable
sdf <- diff(lh02, lag = 12)
###### Plot ACF and PACF for Seasonally Differenced Data ######
ggAcf(sdf) +
ggtitle("ACF of Seasonally Differenced H02 Sales") ggPacf(sdf) +
ggtitle("PACF of Seasonally Differenced H02 Sales") From the above ACF plot, You can see that seasonal correlation is removed but there is ordinary correlation left.
Let’s do an ordinary differencing on top of seasonal differencing.
lh02 %>% diff(lag=12) %>%diff() %>%
ggtsdisplay(xlab="Year",
main="Seasonally & Ordinary differenced H02 scripts")In the plots of the differenced data, there are spikes in the PACF at lags 12 and almost at 24, and seasonal lag 12 in the ACF. The pattern in the ACF is not indicative of any simple model.
So q=0,1 ; p=1,2; Q=0,1,2; P=0,1; d=1; D=1; s=12
We fit these model, and compare the AIC,BIC and AICc values.
#write a funtion
SARIMA.c=function(p1,p2,q1,q2,P1,P2,Q1,Q2,data){
#K=(p2+1)*(q2+1)*(P2+1)*(Q2+1)
temp=c()
d=1
D=1
s=12
i=1
temp= data.frame()
ls=matrix(rep(NA,9*24),nrow=24)
for (p in p1:p2)
{
for(q in q1:q2)
{
for(P in P1:P2)
{
for(Q in Q1:Q2)
{
if(p+d+q+P+D+Q<=10)
{
model<- Arima(data,order=c(p,d,q),seasonal=c(P,D,Q))
ls[i,]= c(p,d,q,P,D,Q,model$aic,model$bic,model$aicc)
i=i+1
#print(i)
}
}
}
}
}
temp= as.data.frame(ls)
names(temp)= c("p","d","q","P","D","Q","AIC","BIC","AICc")
temp
}
output=SARIMA.c(p1=1,p2=2,q1=0,q2=1,P1=0,P2=1,Q1=0,Q2=2,data=lh02)
#SARIMA.c(2,3,1,2,1,2,1,3,data=lh02)
minaic=output[which.min(output$AIC),]
minbic=output[which.min(output$BIC),]
minaic p d q P D Q AIC BIC AICc
21 2 1 1 0 1 2 -484.5096 -464.996 -484.0531
minbic p d q P D Q AIC BIC AICc
14 2 1 0 0 1 1 -482.7753 -469.7662 -482.5602
The models suggested are: ARIMA(2,1,1)x(0,1,2) ARIMA(2,1,0)x(0,1,1)
Another way of doing this:
d=1
D=1
s=12
ARIMA_fit <- list()
## set counter
cc <- 1
for (p in 2:3)#2 p=1,2
{
for(q in 1:2)#2 q=1,2
{
for(P in 1:2)#2 P=0,1
{
for(Q in 1:3)#3 Q=0,1,2
{
if(p-1+d+q-1+P-1+D+Q-1<=10)
{
ARIMA_fit[[cc]] <- Arima(lh02,order=c(p-1,d,q-1),seasonal=c(P-1,D,Q-1))
cc <- cc + 1
}
}
}
}
}
# get AIC values for model evaluation
ARIMA_AIC <- sapply( ARIMA_fit, function(x) AIC(x))
## model with lowest AIC is the best
ARIMA_fit[[which(ARIMA_AIC == min(ARIMA_AIC))]]Series: lh02
ARIMA(2,1,1)(0,1,2)[12]
Coefficients:
ar1 ar2 ma1 sma1 sma2
-1.1358 -0.5753 0.3683 -0.5318 -0.1817
s.e. 0.1608 0.0965 0.1884 0.0838 0.0881
sigma^2 = 0.004278: log likelihood = 248.25
AIC=-484.51 AICc=-484.05 BIC=-465
ARIMA_BIC <- sapply(ARIMA_fit , function(x) BIC(x))
## model with lowest BIC is the best
ARIMA_fit[[which(ARIMA_BIC == min(ARIMA_BIC))]]Series: lh02
ARIMA(2,1,0)(0,1,1)[12]
Coefficients:
ar1 ar2 sma1
-0.8491 -0.4207 -0.6401
s.e. 0.0712 0.0714 0.0694
sigma^2 = 0.004387: log likelihood = 245.39
AIC=-482.78 AICc=-482.56 BIC=-469.77
#ARIMA_fit[[which(ARIMA_AIC == min(ARIMA_AIC)) & which(ARIMA_BIC == min(ARIMA_BIC)) ]]The models suggested are: ARIMA(2,1,1)x(0,1,2) ARIMA(2,1,0)x(0,1,1)
(fit <- Arima(h02, order=c(2,1,1), seasonal=c(0,1,2),lambda=0))Series: h02
ARIMA(2,1,1)(0,1,2)[12]
Box Cox transformation: lambda= 0
Coefficients:
ar1 ar2 ma1 sma1 sma2
-1.1358 -0.5753 0.3683 -0.5318 -0.1817
s.e. 0.1608 0.0965 0.1884 0.0838 0.0881
sigma^2 = 0.004278: log likelihood = 248.25
AIC=-484.51 AICc=-484.05 BIC=-465
checkresiduals(fit, lag=36)
Ljung-Box test
data: Residuals from ARIMA(2,1,1)(0,1,2)[12]
Q* = 51.096, df = 31, p-value = 0.01298
Model df: 5. Total lags used: 36
#or
(fit2 <- Arima(lh02, order=c(2,1,0), seasonal=c(0,1,1)))Series: lh02
ARIMA(2,1,0)(0,1,1)[12]
Coefficients:
ar1 ar2 sma1
-0.8491 -0.4207 -0.6401
s.e. 0.0712 0.0714 0.0694
sigma^2 = 0.004387: log likelihood = 245.39
AIC=-482.78 AICc=-482.56 BIC=-469.77
checkresiduals(fit2, lag=36)
Ljung-Box test
data: Residuals from ARIMA(2,1,0)(0,1,1)[12]
Q* = 59.282, df = 33, p-value = 0.003322
Model df: 3. Total lags used: 36
In another article https://otexts.com/fpp2/seasonal-arima.html they have found that the best model is ARIMA(3,0,1)(0,1,2)12.
(fit <- Arima(h02, order=c(3,0,1), seasonal=c(0,1,2),lambda=0))Series: h02
ARIMA(3,0,1)(0,1,2)[12]
Box Cox transformation: lambda= 0
Coefficients:
ar1 ar2 ar3 ma1 sma1 sma2
-0.1603 0.5481 0.5678 0.3827 -0.5222 -0.1768
s.e. 0.1636 0.0878 0.0942 0.1895 0.0861 0.0872
sigma^2 = 0.004278: log likelihood = 250.04
AIC=-486.08 AICc=-485.48 BIC=-463.28
checkresiduals(fit, lag=36)
Ljung-Box test
data: Residuals from ARIMA(3,0,1)(0,1,2)[12]
Q* = 50.712, df = 30, p-value = 0.01045
Model df: 6. Total lags used: 36
Model Diagnostic: There are a few significant spikes in the ACF, and the model fails the Ljung-Box test. The model can still be used for forecasting, but the prediction intervals may not be accurate due to the correlated residuals.
###### Capture SARIMA Model Outputs ######
# SARIMA(2,1,1)(0,1,2)[12]
model_output1 <- capture.output(sarima(lh02, 2, 1, 1, 0, 1, 2, 12))cat(model_output1[grep("Coefficients", model_output1):length(model_output1)], sep = "\n")Coefficients:
Estimate SE t.value p.value
ar1 -1.1358 0.1608 -7.0622 0.0000
ar2 -0.5753 0.0965 -5.9630 0.0000
ma1 0.3683 0.1884 1.9546 0.0521
sma1 -0.5318 0.0838 -6.3462 0.0000
sma2 -0.1817 0.0881 -2.0622 0.0406
sigma^2 estimated as 0.004165779 on 186 degrees of freedom
AIC = -2.5367 AICc = -2.535002 BIC = -2.434534
# SARIMA(2,1,0)(0,1,1)[12]
model_output2 <- capture.output(sarima(lh02, 2, 1, 0, 0, 1, 1, 12))cat(model_output2[grep("Coefficients", model_output2):length(model_output2)], sep = "\n")Coefficients:
Estimate SE t.value p.value
ar1 -0.8491 0.0712 -11.9258 0
ar2 -0.4207 0.0714 -5.8926 0
sma1 -0.6401 0.0694 -9.2245 0
sigma^2 estimated as 0.004318266 on 188 degrees of freedom
AIC = -2.527619 AICc = -2.526947 BIC = -2.459509
# SARIMA(3,0,1)(0,1,2)[12]
model_output3 <- capture.output(sarima(lh02, 3, 0, 1, 0, 1, 2, 12))cat(model_output3[grep("Coefficients", model_output3):length(model_output3)], sep = "\n")Coefficients:
Estimate SE t.value p.value
ar1 -0.2653 0.1691 -1.5687 0.1184
ar2 0.5011 0.0813 6.1665 0.0000
ar3 0.5394 0.0848 6.3626 0.0000
ma1 0.4572 0.1904 2.4013 0.0173
sma1 -0.5031 0.0847 -5.9385 0.0000
sma2 -0.2030 0.0871 -2.3305 0.0209
constant 0.0038 0.0009 4.4331 0.0000
sigma^2 estimated as 0.004023383 on 185 degrees of freedom
AIC = -2.552016 AICc = -2.548846 BIC = -2.416287
The Standard Residual Plot appears good in all 3 models, displaying okay stationarity with a nearly constant mean and variation.
The Autocorrelation Function (ACF) plot shows nearly no correlation in all 3 models.
The Quantile-Quantile (Q-Q) Plots still demonstrates near-normality.
The Ljung-Box test results reveal values above the 0.05 (5% significance) threshold, indicating the absence of significant correlation and suggesting a well-fitted model.
$ttable:
sarima(lh02,2,1,1,0,1,2,12) all coefficients are significant except ma1. ;aic = -484.51 sarima(lh02,2,1,0,0,1,1,12) all coefficients are significant. ;aic = -482.78 sarima(lh02,3,0,1,0,1,2,12) all coefficients are significant. ;aic = -489.99
Next we will try using the automatic ARIMA algorithm.
auto.arima(lh02)Series: lh02
ARIMA(2,1,1)(0,1,2)[12]
Coefficients:
ar1 ar2 ma1 sma1 sma2
-1.1358 -0.5753 0.3683 -0.5318 -0.1817
s.e. 0.1608 0.0965 0.1884 0.0838 0.0881
sigma^2 = 0.004278: log likelihood = 248.25
AIC=-484.51 AICc=-484.05 BIC=-465
Running auto.arima() with all arguments left at their default values led to an ARIMA(2,1,1)(0,1,2) 12 model. However, as you can see from the below results from the checkresiduals(); model still fails the Ljung-Box test for 36 lags. Sometimes it is just not possible to find a model that passes all of the tests.
(fit4 <- Arima(lh02, order=c(2,1,1), seasonal=c(0,1,2)))Series: lh02
ARIMA(2,1,1)(0,1,2)[12]
Coefficients:
ar1 ar2 ma1 sma1 sma2
-1.1358 -0.5753 0.3683 -0.5318 -0.1817
s.e. 0.1608 0.0965 0.1884 0.0838 0.0881
sigma^2 = 0.004278: log likelihood = 248.25
AIC=-484.51 AICc=-484.05 BIC=-465
checkresiduals(fit4, lag=36)
Ljung-Box test
data: Residuals from ARIMA(2,1,1)(0,1,2)[12]
Q* = 51.096, df = 31, p-value = 0.01298
Model df: 5. Total lags used: 36
It looks however, sarima(lh02,3,0,1,0,1,2,12) is a better model.
fit<-h02 %>%
Arima(order=c(3,0,1), seasonal=c(0,1,2), lambda=0)
autoplot(h02) +
autolayer(meanf(h02, h=36),
series="Mean", PI=FALSE) +
autolayer(naive(h02, h=36),
series="Naïve", PI=FALSE) +
autolayer(snaive(h02, h=36),
series="SNaïve", PI=FALSE)+
autolayer(rwf(h02, h=36, drift=TRUE),
series="Drift", PI=FALSE)+
autolayer(forecast(fit,36),
series="fit",PI=FALSE) +
guides(colour=guide_legend(title="Forecast"))f2 <- snaive(h02, h=36)
#checkresiduals(f2)
accuracy(f2) ME RMSE MAE MPE MAPE MASE ACF1
Training set 0.03265974 0.07415608 0.06061829 4.345164 8.02116 1 0.2700878
summary(fit)Series: .
ARIMA(3,0,1)(0,1,2)[12]
Box Cox transformation: lambda= 0
Coefficients:
ar1 ar2 ar3 ma1 sma1 sma2
-0.1603 0.5481 0.5678 0.3827 -0.5222 -0.1768
s.e. 0.1636 0.0878 0.0942 0.1895 0.0861 0.0872
sigma^2 = 0.004278: log likelihood = 250.04
AIC=-486.08 AICc=-485.48 BIC=-463.28
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set 0.001736226 0.0493495 0.03596218 0.2515195 4.621055 0.5932563
ACF1
Training set -0.01964279
Accuracy measurements are smaller in the fitted model indicating a better model.
Forecasts from the ARIMA(3,0,1)(0,1,2)12 model.
h02 %>%
Arima(order=c(3,0,1), seasonal=c(0,1,2), lambda=0) %>%
forecast() %>%
autoplot() +
ylab("H02 sales (million scripts)") + xlab("Year")